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Abstract 

Distributed source coding is traditionally viewed in the block coding context — all the source symbols are known 
in advance at the encoders. This paper instead considers a streaming setting in which iid source symbol pairs are 
revealed to the separate encoders in real time and need to be reconstructed at the decoder with some tolerable end-to- 
end delay using finite rate noiseless channels. A sequential random binning argument is used to derive a lower bound 
on the error exponent with delay and show that both ML decoding and universal decoding achieve the same positive 
error exponents inside the traditional Slepian-Wolf rate region. The error events are different from the block-coding 
error events and give rise to slightly different exponents. Because the sequential random binning scheme is also 
universal over delays, the resulting code eventually reconstructs every source symbol correctly with probability 1. 
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Fig. 1. Slepian-Wolf distributed encoding and joint decoding of a pair of correlated sources. 
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I. Introduction 

Traditionally, "lossless" coding is considered using two distinct paradigms: fixed block coding and variable- 
length coding 2 . As classically understood, both consider that the source-symbols are known in advance at the 
encoder and that they must be mapped into a string of bits decoded by the receiver. Fixed-block coding accepts a 
small probability of error and constrains the length of the bit-string, while variable-length encoding constrains only 
the expected length of the bit-string in exchange for keeping the probability of error at zero. In the point-to-point 
setting, both paradigms apply generically. In contrast, distributed source coding, has traditionally been explored 
within the fixed block context. In [1], Slepian and Wolf even asked: 

What is the theory of variable-length encodings for correlated sources? 

In the classical context of source realizations known entirely in advance, the answer is simple: there is no 
nontrivial sense of variable-length encoding that applies generically while still being interesting. 3 This is easiest to 
see by example (Illustrated in Figure ^and revisited as Example 2 in Section Hvt . Suppose that the first encoder 
observes the random vector x, which consists of a sequence of N iid uniform binary random variables. Suppose 
further that the second encoder observes y which is related to x via a memoryless binary symmetric channel with 
crossover probability p < 0.5. The Slepian-Wolf sum-rate bound is H(x,y) = 1 + H(p) < 2 — H(x) + H{y). But 
since the individual encoders only see uniformly distributed binary sources, they do not know when the sources are 
behaving jointly atypically. Therefore, they have no basis on which to adjust their encoding rates to combat joint 
atypicality. Since all pairs are possible when finite blocklengths are considered, the individual encoders must use 
distinct bit-strings for each of them. Since the expected length depends only on the uniform marginal distributions, 
this means that the expected length must be at least N. Thus, variable-length approaches do not, in general 4 , lead 
to zero-error Slepian-Wolf codes for interesting rate-points. 

Another view of variable-length coding is as a tool that enables us to achieve meaningful compression despite 
not knowing the underlying probability distribution 5 and allowing the rate used to adapt to the source. If there is 

'This material was presented in part at the IEEE Int Symp Inform Theory, Adelaide, Australia, Sept 2005. 

2 There are actually four different traditional cases: fixed to fixed, fixed to variable, variable to fixed, and variable to variable. However, the 
last three all achieve a probability of error of zero and so we consider them together. 

3 At least at sum rates close to the joint source entropy rate. If the rates of communication are high enough, e.g., equaling the log of the 
cardinalities of the source alphabets, zero-error communication is possible. 

4 One should note that, in analogy to zero-error channel coding, there are special (non-generic) cases where zero-error Slepian-Wolf coding 
is possible [2] since certain symbol pairs cannot occur. 

5 In the point-to-point case, this is very closely related to achieving a zero-error probability. The same string can be an atypical realization 
of one source model while being a typical realization of another source. Encoding all the typical sequences correctly without knowing the 
underlying model requires getting all the possible sequences correctly for any specific model. 



a low-rate, but reliable 6 , feedback link available from the decoder to the two separate encoders, then this sense of 
variable-length Slepian-Wolf coding is possible. [5] gives a fixed-to-variable scheme in which the stopping-time is 
chosen at the decoder and communicated back to the encoders over a low-rate feedback link. The goal of [5] is 
not achieving a truly zero probability of error — rather it is willing to accept a very small probability of error in 
exchange for using a rate that is as small as possible. 

To answer the question posed by Slepian and Wolf in the more classical sense, we instead want to aim for 
a probability of error that goes to zero for every source symbol, but at the cost of a variable delay. To do this, 
we propose stepping back and eliminating the modeling assumption of encoders having access to the entire source 
realization in advance. We argue that a "streaming setting" is required to discern the system-level analog to variable- 
length source coding in the distributed context. The streaming setting abstracts sources that are embedded in time as 
well as the fact that all physically realizable encoders/decoders must obey some form of causality. Thus "rate" is not 
just measured in bits per source symbol but in both source symbols per second and bits per second. The source-rate 
(symbols per second) is specified as a part of the problem while the bit-rate (bits per second) is something that we 
get to choose. From an engineering perspective, three desirable qualities 7 are: 

• Using a low rate bit-pipe(s) 

• Low end-to-end latency 

• Low probability of error 

The theory of source-coding should tell us the tradeoffs between these three desiderata. In addition, we will be 
interested in to what extent a streaming code can be made "universal" over a class of probability distributions. 

In the point-to-point streaming setting, regardless of whether block or variable-length compression is used, the 
traditional initial step is the same: group symbols into source blocks. To compress the data blocks, either use a 
fixed-rate block code, or a variable-length code. The resulting encoding is then enqueued for transmission across the 
bit-pipe. As long as the source entropy rate is below the data-rate, the queue will remain stable. When block coding 
is used for compression, there is a constant delay through the system, and atypical source blocks are received in 
error. The probability of error is fixed at the system's design-time and so is the end-to-end delay. 

In contrast, variable-length coding induces a variable system delay. The more unlikely the source blocks, the 
longer the delay experienced at run-time. Thus, while asymptotically there are no errors when variable-length source 
codes are used (assuming an infinite buffer size), the delay till a given symbol can be decoded depends on the 
random source realization. Because atypical source realizations are large deviation events, the probability that some 
source symbol cannot be reconstructed A samples after it enters the encoder decays exponentially 8 in A. The choice 
of acceptable end-to-end delay is left to the receiver/application. 

We show that this type of reliability can be achieved in a generic distributed coding context — the probability 
of error goes to zero with end-to-end delay and the choice of the acceptable delay is entirely up to the decoder. 
Essentially, every source symbol is recovered correctly eventually with probability 9 1. The only difference is that 
unlike the point-to-point case, the decoder does not necessarily know when the estimate for the symbol has converged 
to its final value. Furthermore, just as in the point-to-point setting 10 , both the encoding and decoding can be made 
universal. 

In this paper, we formally define a streaming Slepian-Wolf code, and develop coding strategies both for situations 
when source statistics are known and when they are not. The new tool is a sequential binning argument that parallels 
the tree-coding arguments used to study convolutional codes. We characterize the performance of the streaming 
schemes through an error exponent analysis and demonstrate that the exponents are equal regardless of whether the 
system is informed of the source statistics (in which case we use maximum likelihood decoding) or not (in which 
case we use universal decoding). The universal decoder we design for the streaming problem is somewhat different 
from those familiar from the block coding literature, as are the nature of the error exponents. 

6 It is clear that our techniques from [3], [4] can also be adapted to make the system of [5] work using only noisy feedback channels. 
7 Of course, "implementation complexity" forms a fourth and very important consideration, but we will be ignoring that aspect of the problem. 
8 In [6], we show that variable length codes used in this manner actually achieve the best possible error exponent with delay. This is also 
related to the analysis of [7]. 

9 The secret here is that we are considering a probability measure over infinite sequences. While all pairs of finite strings may be possible, 
most pairs of infinite strings collectively have probability zero. 

10 Sliding-window Lempel-Ziv compression is one example where data is naturally encoded sequentially. It is also universal over sources. 
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A. Potential applications and practical motivation 

In addition to our core interest in answering some basic questions about Slepian-Wolf coding, our formulation is 
also motivated by the diverse emerging application areas for distributed source coding. Media (e.g. video-conference) 
sources naturally have a streaming character. Consequently, we are motivated to explore what sort of streaming 
Slepian-Wolf technique matches naturally to such situations. 11 

B. Outline 

Section[n]summarizes the notation used in the paper. Section[ni]reviews the classical block-coding error exponent 
results for Slepian-Wolf source coding and then we state the main results of this paper: sequential error exponents 
for Slepian-Wolf source coding. Section llVl presents a numeric study of two example sources. We observe that the 
sequential error exponent is often the same as the block coding error exponent. Sections IV! IVII and IVIII prove the 
theorems in Section [ni| We start with sequential source coding for single sources in[V] This is the simplest case 
but it provides insights to the nature of sequential source coding problem and sequential error events. We show that 
the sequential error exponent is the same as the random block source coding error exponent. Section IVT1 moves on 
to the case with decoder side-information. Finally, Section IvTll presents the proof of the main result of the paper. 
We derive the sequential error exponent of distributed source coding for correlated sources. This error exponent 
strictly positive everywhere inside the achievable rate region of [1]. For all these three scenarios in Sections IV! IVII 
and IVIII both ML and universal decoding rules are studied. The appendix shows that the resulting error exponents 
are indeed the same. 

II. Notation 

We use serifed-fonts, e.g., x to indicate sample values, and sans-serif, e.g., x, to indicate random variables. 
Bolded fonts are reserved to indicate sample or random vectors, e.g., x = x n and x = x n , respectively, where 
the vector length (n here) is understood from the context. Subsequences, e.g., xi,Xi+±, . . . ,x n are denoted as xf 
where x\ = if i < j. Distributions are indicated with lower-case p, e.g., x is distributed according to p x (x). Sets 
and their elements are denoted as, e.g., x € X, and their cardinality by \X\. We use calligraphic font to denote 
sets, X, T, W etc, and reserve £ and V to denote encoding and decoding functions, respectively. We use standard 
notation for types, see, e.g., [8]. Let iV(a;x) denote the number of symbols in the length-?i vector x that take on 
value a. Then, x is of type P if P(a) = N(a; x) /n. The type-class, or set of length-n vectors of type P is denoted 
Tp. A sequence y has conditional type V given x if N(a, 6;x, y) = N (a; x)V (b\a) = P(a)V(b\a) for every a,b. 
The set of sequences y having conditional type V with respect to x is called the T^-shell of x and is denoted by 
Ty(x). When considered together, the pair (x, y) is said to have joint type V x P. We always use upper-case, 
e.g., P and V, to denote length-7i types and conditional types. As we often discuss the types of subsequences we 
add a superscript notation to remind the reader of the length of the subsequence in question. If, for instance, the 
subsequence under consideration is xf we write xf E T P n-i. Similarly we use V n for the conditional type of 
lengthen — I + 1), and V n ~ l x P n ~ l for the joint type. 

Given a joint type V x P, entropies and conditional entropies are denoted as H(P) and H(V\P), respectively. 
The KL divergence between two distributions q and p is denoted by D(q\\p). 

III. Main Results 

In this section, we begin by reviewing classical results on the error exponents of distributed block coding. We 
then present the main results of the paper: error exponents for streaming Slepian-Wolf coding and its special cases: 
point-to-point coding and source coding with decoder side information. We analyze both maximum likelihood and 
universal decoding and show that the achieved exponents are equal. Leaving numerical examples and proofs for 
later sections, we here compare the form of the streaming exponents with their block coding counterparts. 

11 A secondary aspect in some multimedia settings is a natural multi-scale nature to the source — the high order bits are more important 
than the low order bits. To the extent that the high order bits can be made "early" and the low-order bits can be made "late", our constructions 
also naturally give more protection to the early bits as compared to the later ones. While this interpretation might eventually be important in 
practice, it is a bit questionable within the simplified model this paper considers. 
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A. Block source coding and error exponents 

In the classic block-coding Slepian-Wolf paradigm, full length-iV vectors x and y are observed by their respective 
encoders before communication commences. In this situation a mte-(R x , R y ) length- N block source code consists 
of an encoder-decoder triplet {£%, T>n), as we will define shortly. For the rate -region considerations, the general 
case of distributed encoders can be considered by using time-sharing among codes that alternate between sending 
at rates close to the marginal entropy and those that correspond to perfectly known side-information. However, it 
is easy to see that this results in a substantial loss of error-exponent even in the block-coding case. To get good 
exponents, something else is required: 

Definition 1: A randomized length- N mte-(R x , R y ) block encoder-decoder triplet {£%, T>m) is a set of maps 

X N ^{0,1} NR *, e.g., £%{x N ) = a NR * 

y N ^{0,l} NR y, e.g., S y N (y N ) = b NR y 

{0, x {0, l} NR y X n x y n , e.g., V N (a NR *,b NR y) = (x N ,y N ) 



u N 

where common randomness, shared between the encoders and the decoder is assumed. This allows us to randomize 
the mappings independently of the source sequences. 

The error probability typically considered in Slepian-Wolf coding is the joint error probability, Pi[(x N ,y N ) ^ 
(x N ,y )] = Pr[(x N , y ) ^ T>n(£^(x n ), £j(r(y ))]• This probability is taken over the random source vectors 
as well as the randomized mappings. An error exponent E is said to be achievable if there exists a family of 
iate-(R x , R y ) encoders and decoders {(£%, £ v Nl fjv)}, indexed by N, such that 

Jim -llogPr[(x^, y w ) + (x N ,y N )] > E. (1) 

In this paper, we study random source vectors (x, y) that are iid across time but may have dependencies at any 
given time: 

N 

Px,y(x,y) = Y[px,y( X i^yi)- 

i=l 

For such iid sources, upper and lower bounds on the achievable error exponents are derived in [9], [8]. These 
results are summarized by the following theorem. 

Theorem 1: (Lower bound) Given a rate pair (R x ,Ry) such that R x > H(x\y), R y > H(y\x), R x + R y > 
H(x,y). Then, for all 

E < mmD(p- xry \\p xy ) + \ mm[R x + R y - H(x,y),R x - H{x\y),R y - H(y\x)}\ + (2) 

there exists a family of randomized encoder-decoder mappings as defined in Definition such that ([0 is satisfied. 
In (0 the function |z| + = zifz>0 and \z\ + = if z < 0. 

(Upper bound) Given a rate pair (R x , Ry) such that R x > H(x\y), R y > H(y\x), R x + R y > H(x,y). Then, 
for all 



£> min^ mm D(p x& \\p xy ), mm D(p x ^\\p xy ), mm D{p x , y \\p xy ) \ (3) 

{x,y.R x <H(x\y) x,y:R y <H(y\x) x,y:R m +Ry<H(x,y) J 

there does not exists a randomized encoder-decoder mapping as defined in Definition ^ sucn that ([0 is satisfied. 
In both bounds (x,y) are dummy random variables with joint distribution p X: y- 

Remark: As long as (R x ,R y ) is in the interior of the achievable region, i.e., R x > H(x\y), R y > H(y\x) and 
R x + R y > H(x,y) then the lower-bound (|2) is positive. The achievable region is illustrated in Fig [2] As shown 
in [8], the upper and lower bounds (|3j and (|2) match when the rate pair (R x ,R y ) is achievable and close to the 
boundary of the region. This is analogous to the high rate regime in channel coding where the random coding 
bound (analogous to (|2)) and the sphere packing bound (analogous to (|3j) agree. 

Theorem[0can also be used to generate bounds on the exponent for source coding with decoder side information 
(i.e., y observed at the decoder), and for source coding without side information (i.e., y is a constant). These 
corollaries will prove useful as a basis for comparison as we build up to the complete solution for streaming 
Slepian-Wolf coding. 
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Corollary 1: (Source coding with decoder side information) Consider a Slepian-Wolf problem where y is known 
by the decoder. Given a rate R x such that R x > H(x\y), then for all 

E < mmD{ Pxry \\ Pxy ) + \R X - H{x\y)\+, (4) 
x,y 

there exists a family of randomized encoder-decoder mappings as defined in Definition [2 such that Q is satisfied. 

The proof of Corollary \l\ follows from Theorem [2 by letting Ry be arbitrarily large. Similarly, by letting y 
be deterministic so that H(x\y) = H(x) and H(y) = 0, we get the following random-coding bound for the 
point-to-point case of a single source x. 

Corollary 2: (point-to-point) Consider a Slepian-Wolf problem where y is deterministic, i.e., y = y. Given a rate 
R x such that R x > H(x), for all 

E < mmD{ Px \\p x ) + \R X - H{x)\+ = E X {R X ) (5) 
there exists a family of randomized encoder-decoder triplet as defined in Definition ^ such that Q is satisfied. 



log \y\ 
H{y) 

H{y\x) 



H{x\y)H{x) log |*| 

Fig. 2. Achievable region for Slepian-Wolf source coding 
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B. Sequential Distributed Source Coding 

We now state our main results for streaming encoding, and contrast them with the block-coding results of the 
last section. To begin, we define a streaming encoder. 

Definition 2: A randomized sequential encoder-decoder triplet £ x ,£ y ,T> is a sequence of mappings, {£J},j = 
1,2,..., {£*},j = 1,2,... and {Vj\,j = 1,2,...: 

£* : Av .{„.!['••. e.g., = tt*l x)Bm+v 

£] : J'< -I0.1!", e.g., £«{yi) = b$Z l)Ry+v 

Common randomness, shared between encoders and decoder, is assumed. This allows us to randomize the mappings 
independently of the source sequence. 
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In this paper, the sequential encoding maps will always work by assigning random "parity bits" in a causal manner 
to the observed source sequence. That is, the R x (or R y ) bits generated at each time in are iid Bernoulli-(0.5). 12 
Since parity bits are assigned causally, if two source sequences share the same length-/ prefix, then their first IR X 
parity bits must match. Subsequent parities are drawn independently. Such a sequential coding strategy is the source- 
coding parallel to tree and convolutional codes used for channel coding [10]. In fact, we call these "parity bits" as 
they can be generated using an infinite constraint-length time-varying random convolutional code. 

Definition 3: The decoder mapping 

n, : {0, l }■-'•' x {o, iy R y — > X i x J>' 

At each time j the decoder T>j outputs estimates of all the source symbols that have entered the encoder by time j. 

Remark: While we state Definition |2] only for Slepian-Wolf coding, it immediately specializes to source coding 
with decoder side information (dropping the £ y and revealing y" to the decoder), and source coding without side 
information (dropping the £ y ). We present results for both these situations as well. 

In this paper we study two error probabilities. We define the pair of source estimates at time n as (x",y rl ) = 
^"(IljLi £ji rij=i £j)< where YYj=i &j indicates the full nR x bit stream from encoder x up to time n. We use 
(x" _A ,y"~ A ) to indicate the first n — A symbols of each estimate, where for conciseness of notation both the 
estimate time, n, and the decoding delay, A, are indicated in the superscript. With these definitions the two error 
probabilities we study are 

Pr[x"- A ^ x n ~ A ] and Pr[y"- A ^ y"~ A ]. 

A pair of exponents E x > and E y > is said to be achievable if there exists a family of rate-(R x , R y ) encoders 
and decoders { {£ J , £ J , V 3 ■) } such that 

lim lim -^-logPr[x"- A f x"~ A ] > E x (7) 
A— >oc n— *oo ia 

lim lim -llogPr[y"- A ^ y"~ A ] > EL (8) 

Remarks: In contrast to Q the error exponent we look at is in the delay, A, rather than total observation time, 
n. The order of the limits is important since the total time-period n is allowed to go to infinity faster than the delay 
A. While the definitions of @-(|8|l and of Q are asymptotic in nature, the results hold for finite block-lengths 
and delays as well. Finally, we note that while in Q the error exponent of a joint error event on either x or y is 
considered, we provide a refined analysis specifying potentially different exponents on either decision. The results 
for joint errors are found by taking the minimum of the individual exponents, i.e., 

lim lim -ilogPr[(x"- A ,y"- A ) ^ (x"- A ,y n - A )] > min{E x ,E y }. 

A— >oo n— >oo Za 

C. Streaming source coding 

Our first results concern streaming coding in the point-to-point setting. The first theorem we state gives random 
coding error exponents for maximum likelihood decoding where the source statistics are known, and the second 
exponents for universal decoding, where they are not. 

l2 We assume that R x and R y are integer. To justify this assumption note that we can always group sets of a successive symbols into 
super-symbols. These larger symbols can be encoded at an average rate aR x . Generally, if we group a symbols together, and transmit bits 
per super-symbol, we can realize an average rate ot//3, i.e., a rational rate. If desired, non-integer average rates are easily implemented by 
a time- varying transmission rate. For example, say we want to implement an average encoding rate of 5/4 bits per source symbol. Say we 
generate one new parity bit per symbol for each symbol observed except for the fourth symbol, eighth symbol, etc, when we generate two. The 
average encoding rate is 5/4. As long as the decoding delay A we target is long enough so that the decoder received an "average" number 
of encoded bits - SR X - before we must make an estimate (e.g., if A 3> l/R x ), these small-scale issues even out. In particular, they do not 
effect the exponents. 
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Theorem 2: Given a rate R x > H(p x ), there exists a randomized streaming encoder and maximum likelihood 
decoder pair (per Definition [2ji such that for all E < Eml(Rx) there is a constant K > such that Pr[x"~ A ^ 
x n ' A ] < Kexp{-AE ML (R X )} for all n, A > where 

E ML {R X )= sup P R X -(1 + p) log Vp^)^ ] . (9) 

Theorem 3: Given a rate R x > H(p x ), there exists a randomized streaming encoder and universal decoder pair 
(per Definition |3 such that for all E < E UN (R X ) there is a constant K > such that Pr[x"~ A ^ x" _A ] < 
K exp{-A£'} for all n, A > where 

E UN (R X ) = inf D(?||p x ) + |i*x - #(<z)l + , (10) 

where q is an arbitrary probability distribution on X and where \z\+ = z if z > and = if z < 0. 

Remark: The error exponents of Theorems [2] and [3] both equal their respective random block-coding exponents 
for ML and universal decoders. For example, compare (llOt with (|3J- The main difference in the formulation is that 
the error probability decays with delay A rather than block length N. Furthermore, it is known that © and JlQi are 
equal — see [8] exercise 13 on page 44. Such equality is required by the formal definition of a universal scheme, 
i.e., for the same source statistics and coding rates, the universal decoder should asymptotically achieve the same 
error exponent as the maximum likelihood decoder. See [11] for a detailed discussion of universal versus maximum 
likelihood decoding in the context of channel coding. 

D. Streaming distributed source coding with decoder side information 

This section summarizes our results for distributed streaming source coding when the side information is observed 
at the decoder, but not the encoder: 

Theorem 4: Given a rate R x > H(x\y), there exists a randomized encoder decoder pair (per Definition |2j such 
that for all E < E M l,si(Rx) there is a constant K > such that Pr[x"~ A ^ x n ~ A ] < K aq>{-AE} for all 
n, A > where 

(ID 



Eml.si(Rx) = sup pR x - log V VV y O,?/)i+p 

n<n<i L * — ' L * — ' 



0< P <1 y 



Theorem 5: Given a rate R x > H(x\y), there exists a randomized encoder decoder pair (per Definition [2] ) such 
that for all E < E UN , SI (R X ) there is a constant K > such that Pr[x"~ A ^ x n " A ] < K exp{-A.E} for all 
n, A > where 

E UN ,si(R x ) = inf D{ Pxry \\ Pxy ) + \R X - H(x\y)\+, (12) 

x,y 

and (x,y) are random variables with joint distribution p x ,p, H(x\y) is their conditional entropy, and where \z\ + = z 
if z > and \z\+ = if z < 0. 

Remark: Similar to the point-to-point case, the error exponents of Theorems and [5] both equal their respective 
random block-coding exponents. For example, compare d!2i with @. Similarly, il It and d!2i can be shown to be 
equal. 

E. Streaming Slepian-Wolf coding 

In contrast to streaming point-to-point coding and streaming source coding with decoder side information, the 
general case of streaming Slepian-Wolf coding with two distributed encoders results in error exponents that differ 
from their block coding counterparts. In the streaming setting, fundamentally different error events dominate as 
compared to the block setting. 

Theorem 6: Let (R x ,Ry) be a rate pair such that R x > H(x\y), R y > H(y\x), R x + R y > H(x,y). Then, 
there exists a randomized encoder pair and maximum likelihood decoder triplet (per Definition |2ji that satisfies the 
following three decoding criteria. 

(i) For all E < E M LSwA R x,Ry), there is a constant K > such that Pr[x"- A ^ x"- A ] < Kexp{-AE} 
for all n, A > where 

E M L,swARxiRy)=™^\ inf Ef L (R x ,R yjl ), inf -^—E™ L {R X , Ry, 7 ) I. 

76[0,1] 7£[0,1] 1 - 7 
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(ii) For all E < E M L,sw, y (Rx, Ry) there is a constant K > such that Pr[y™~ A ^ y n ~ A ] < Kcxp{-AE} 
for all n, A > where 



E 



ML.SW.y 



(R x ,Ry) = mini inf L (i? x , 7 ), inf £^(^,^,7)1. 

I 7G[0,1] 1 — 7 76[0,1] I 



(iii)Forall £ < E MLtSWtXy (R x ,Ry) there is a constant K > such that Pr[(x"- A , y"- A ) ^ (x"- A , y"~ A )] < 
K exp{-AE} for all n, A > where 



^Mt^xj^,^) = min { inf E* 1L {R Xl R Vll ), inf E 1 ^{R x ,R y , 1 ) 

7e[0,l] 76[0,1] 



In definitions (i)-(iii), 

E^ L {R x ,R y , 1 ) 



Eff L (Rx, Ry, 7) 



su P P e[o.i] [7^|y(^,p) + (1 -l)E xy (R x ,Ry,p)] 
sm V[o,i] [t^I^-R^p) + (1 - l)E xy {R x ,Ry,p)] 



(13) 



and 



E xy (R x , R y , p) 
E x \ y {R x ,p) 

E y\x(Ry,P) 



p(R x + Ry) - log Ex^PxkO 3 ^) 



i+p 



= pi?^ - log 



i+p 



(14) 



= pR y - log [ £ x [ £ y p*>, (z, y) ^ 

Theorem 7: Let (R x ,Ry) be a rate pair such that R x > H(x\y), R y > H(y\x), R x + R y > H(x,y). Then, 
there exists a randomized encoder pair and universal decoder triplet (per Definition |3J that satisfies the following 
three decoding criteria. 

(i) For all E < E UN , S w,x{Rx, R v ), there is a constant K > such that Pr[x™~ A ^ x"~ A ] < K cxp{-AE} 
for all n, A > where 



Eun.swAR^ Ry) = min j ^ inf ^ E% N (R X , R y , 7), ^inf^ —L-^^, 7) 



(15) 



(ii) For all J5 < E UNtSWt y{R Xl R y ), there is a constant X > such that Pr[y"- A ^ y n ~ A ] < If exp{-A£} 
for all ri, A > where 



EuN,sw, y {R*,Ry) = min J inf —^£^(^,^,7), inf £^(#^,7) 

I 76[0,1] 1 — 7 7G[0,1] " 



(16) 



(iii) For all E < E UNiS w,xy(Rx, Ry), there is a constant if > such that Pr[(x ,l ~ A , x n " A ) 7^ (x"- A ,y"- A )] < 
ifexp{-A£'} for all n, A > where 



Eiw,sw>y(Jix,fly) = min { inf (it,, i^, 7), inf E" N {R x , R y , j) 



76[0,1] 



76[0,1] 



In definitions (i)-(iii), 

= inf 7 ^fe.ylbxy) + (1 - 7)^bx.y|bxy) + frfe - #(*|y)] + (1 - 7) + J«y - #(*,?)]! 
x.y.x.y 

= .inf 7 £>(p s>y |b*y) + (1 - 7)Dfe, y ||^) + ^[i^ - if(y|x)] + (1 - 7)^ +i^ - F(x,y)]| 

x.y.x.y 



(17) 



(18) 



where the random variables (x,y) and (x,y) have joint distributions pj >y and Px,y, respectively. The function 

\z\ + = z if z > and |z|+ = if z < 0. 
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Remark: Definitions (i) and (ii) in Theorems [6] and concern individual decoding error events which might be 
useful in applications where the x and y streams are decoded jointly, but utilized individually. The more standard 
joint error event is given by (iii). 

Remark: We can compare the joint error event for block and streaming Slepian-Wolf coding, c.f. d!7l > with (|2). 
The streaming exponent differs by the extra parameter 7 that must be minimized over. If the minimizing 7=1, then 
the block and streaming exponents are the same. The minimization over 7 results from a fundamental difference 
in the types of error-causing events that can occur in streaming Slepian-Wolf as compared to block Slepian-Wolf. 

Remark: The error exponents of maximum likelihood and universal decoding in Theorems |6] and [7] are the same. 
However, because there are new classes of error events possible in streaming, this needs proof. The equivalence is 
summarized in the following theorem. 

Theorem 8: Let (R x , R x ) be a rate pair such that R x > H(x\y), R y > H(y\x), and R x + R y > H(x, y). Then, 



EmL,SW,x(Rx, Ry) = EjjN y SW,x(Rx, Ry), (19) 

and 

EmL,SW,x{Rx, Ry) = ElJN,SW,x{Rx, Ry)- (20) 

Theorem [8] follows directly from the following lemma, shown in the appendix. 
Lemma 1: For all 7 6 [0, 1] 

E A J L {R x ,R y , 1 )^E^ N {R x ,R lnl ), (21) 

and 

E™ L (R x ,R y , 1 )=E y rN (R x ,R y , 1 ). (22) 



Remark: This theorem allows us to simplify notation. For example, we can define E X (R X , Ry, 7) as E X (R X , Ry,j) = 
E^ IL (R X , Ry,"/) = E X JN (R X , R y ,j), and can similarly define E y (R x , R y ,j). Further, since the ML and uni- 
versal exponents are the same for the whole rate region we can define Esw,x(Rx, Ry) as Esw,x{Rx, Ry) = 
Eml,sw,x(Rx,R v ) = E UN , S w,x(Rx,Ry), and can similarly define E S w,y(Rx, Ry)- 



IV. Numerical Results 

To build insight into the differences between the sequential error exponents of Theorem |2] - [8] and block-coding 
error exponents, we give some examples of the exponents for binary sources. 

For the point-to-point case, the error exponents of random sequential and block source coding are identical 
everywhere in the achievable rate region as can be seen by comparing Theorem [3] and Corollary [2] The same is 
true for source coding with decoder side information (cf. Theorem|5]and Corollary[0. For distributed Slepian-Wolf 
source coding however, the sequential and block error exponents can be different. The reason for the discrepancy 
is that a new type of error event can be dominant in Slepian-Wolf source coding. This is reflected in Theorem |6] 
by the minimization over 7. Example 2 illustrates the impact of this 7 term. 

For Slepian-Wolf source coding at very high rates, where R x > H(x), the decoder can ignore any information 
from encoder y and still decode x with with a positive error exponent. However, the decoder could also choose 
to decode source x and y jointly. Fig |6]a and [6]b illustrate that joint decoding may or surprisingly may not help 
decoding source x. This is seen by comparing the error exponent when the decoder ignores the side information 
from encoder y (the dotted curves) to the joint error exponent (the lower solid curves). It seems that when the 
rate for source y is low, atypical behaviors of source y can cause joint decoding errors that end up corrupting x 
estimates. This holds for both block and sequential coding. 



A. Example 1: symmetric source with uniform marginals 

Consider a symmetric source where \X\ = \y\ = 2, p xy (0, 0) = 0.45, p xy (0, 1) = p xy (l,0) = 0.05 and 
p xy (l,l) — 0.45. This is a marginally-uniform source: x is Bernoulli(l/2), y is the output from a BSC with 
input x, thus y is Bernoulli(l/2) as well. For this source H(x) = H(y) = log(2), if(x|y) = H(y\x) — 0.32, 
H(x,y) = 1.02. The achievable rate region is the triangle shown in Figure(|3j. 
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Ru 



0.67 



0.49 




R x + R v = H(x, y y 



Fig. 3. Rate region for the example 1 source, we focus on the error exponent on source x for fixed encoder y rates: R y = 0.49 and R y = 0.67 



For this source, as will be shown later, the dominant sequential error event is on the diagonal line in Fig [9] This 
is to say that: 

Esw,x(R x , Ry) — Eg^, x CK (R x , R y ) — E X ' IL (R x , R y ,0) = sup [E xy (R x ,R y ,p)]. (23) 

P6[0,l] 

Where E§^° X CK (R X1 R y ) = imn{E^ L (R x , R v , 0), E™ L (R X , R y , 1)} as shown in [9]. 
Similarly for source y: 

Esw, y (Rx ,Ry) = E^° y CK {R x ,R y )= Ef L (R x , Ry , 0) = sup [E xy (R x ,R y ,p)}. 

pe[o,i] 

We first show that for this source Vp > 0, E x i y (R x , p) > E xy (R x , R y , p). By definition: 



(24) 



E x \ y (R x , p) — E xy (R x ,R y , p) 



pR x - log \^p xy (x,y)^+f 
y x 

-(piRz+Ry)- log \^2 Pxy {x,y)^-p 

x,y 

-pRy -log [2[^ Pxy (x,0)^ 



1+p 



log [2^p xy (x,0)^ 



i+p 



i+p 



= - P Ry - lOg [2] + lOg [2 
= P (l0g[2] - Ry) 

> 

The last inequality is true because we only consider the problem when R y < log 1 3^ | ■ Otherwise, y is better 
viewed as perfectly known side-information. Now 



E™ L {R Xl R yil ) = sup [ 1 E xly (R x ,p) + (l- 1 )E xy (R x ,R v ,p)] 
pe[o,i] 

> sup [E xy (R x ,R y ,p)} 
pe[o,i] 

= E x (R x ,R y ,0) 
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Similarly E* IL (R Xl R yil ) > E A y [L {R x , R y ,Q) = E* IL (R X , R y , 0). Finally, 

E sw . x (R x ,R y ) = mini inf E x (R x ,R y ,'y), inf 1 E y (R x ,R y ,j) \ 

I 76[0,1] 7£[0,1] 1—7 J 

= E x IL (R x ,R y ,0) 

Particularly E X (R X , R y , 1) > E X (R X , R y , 0), so 

E sw°x CK (Rx,R y ) = mm{E x IL (R x ,R y ,0),E x IL (R x ,R y ,l)} 
= E X IL (R X , R y , 0) 

The same proof holds for source y. 

In Fig |4] we plot the joint sequential/block coding error exponents Esw,x(Rx, Ry) = Eg^x {Rx, Ry), the 
error exponents are positive iff R x > H(xy) — R y = 1.02 — R y . 



CD 

o 



o 



CD 
C 

o 

Q_ 

X 
CD 



O 

i 

LU 0.05- 




R y =0.67 



0.35 

Rate of encoder x 



109(2) 



Fig. 4. Error exponents plot: Esw x(Rx, Ry) plotted for R y = 0.49 and R y = 0.67 

E SW>x (R x ,Ry) = EBLOCK {Rx ' Ry) = EsWy{Rx , Ry) = E BLOCK {Rx Ry) and Ex{Rx) = Q 



B. Example 2: non-symmetric source 

Consider a non-symmetric source where \X\ = \y\ = 2, p xy (0, 0) = 0.1, p xy (0, 1) = p xy (l,0) = 0.05 and 
Pxy(M) = 0.8. For this source H(x) = H(y) = 0.42, H(x\y) = H{y\x) = 0.29 and H(x,y) = 0.71. The 
achievable rate region is shown in Fig|5] In Fig|6]a,|6]b,|6]c and|6]d, we compare the joint sequential error exponent 
Esw,x(Rx,R y ) the joint block coding error exponent E§^° X CK (R x , R y ) = min{E x (R x , R y , 0), E X (R X , R y , 1)} 
as shown in [9] and the individual error exponent for source X, E X (R X ) as shown in Corollary [2] Notice that 
E X (R X ) > only if R x > H(x). In Fig0 we compare the sequential error exponent for source y: Esw,y(Rx, Ry) 
and the block coding error exponent for source y: Eg^° v CK (R x , Ry) = mm{E y (R x , R y , 0), E y (R x , R y , 1)} and 
E y (R y ) which is a constant since we fix R y . 

For R y — 0.35 as shown in Fig[6]a.b and[7]a.b, the difference between the block coding and sequential coding 
error exponents is very small for both source x and y. More interestingly, as shown in Fig [6] a, because the rate of 
source y is low, i.e. it is more likely to get a decoding error due to the atypical behavior of source y. So as R x 
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0.49 



Achievable 
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0.35 




R x + R y = H{x,y) 
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■X 



Fig. 5. Rate region for the example 2 source, we focus on the error exponent on source x for fixed encoder y rates: R y = 0.35 and R y = 0.49 



increases, it is sometimes better to ignore source y and decode x individually. This is evident as the dotted curve 
is above the solid curves. 

For R y = 0.49 as shown in Fig |6]c.d and 0c. d, since the rate for source y is high enough, source y can be 
decoded with a positive error exponent individually as shown in Fig 0c. But as the rate of source x increases, joint 
decoding gives a better error exponent. When R x is very high, then we observe the saturation of the error exponent 
on y as if source x is known perfectly to the decoder! This is illustrated by the fiat part of the solid curves in 



V. Streaming point-to-point coding via sequential random binning 

In this section we prove Theorems |2] and [5] While the emphasis of the paper is on distributed source coding, 
the basic causal random binning ideas and analysis techniques can be more easily developed in the point-to-point 
context. 

A. Maximum-likelihood decoding 

To show Theorems |2 and |5J we first develop the common core of the proof in the context of ML decoding. The 
proof strategy is as follows. A decoding error can only occur if there is some spurious source sequence x n that 
satisfies three conditions: (i) it must be in the same bin (share the same parities) as x n , i.e., x n S B x (x n ), (ii) it 
must be more likely than the true sequence, i.e., p x (x n ) > p x (x n ), and (iii) xi ^ xi for some I < n — A. 

The error probability is 



Fig He. 




(25) 



X 



n-A 




(26) 



X 



1=1 



n-A 




1 = 1 



x 



n-A 




(27) 



i=i 
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Fig. 6. Error exponents plot for source x for fixed R y as R x varies: 
R y = 0.35: 

(a) Solid curve: Esw,x(Rx, Ry), dashed curve Eg^f x CK (R x , R y ) and dotted curve: E X (R X ), notice that Egyi/, x (Rx, Ry) < 
E^° x CK (R x ,R y ) but the difference is small. 

E BLOCK R j 

(b) 101ogi n ( c"''' — it, x p r ) ■ This shows the difference is there at high rates. 
fi a = 0.49: 

(c) Solid curve E SW , X {R X , Ry), dashed curve E§-^° X CK (R x , R y ) and dotted curve: E X (R X ), again Esw^C-R*, Ry) < E§-^° X CK (R x , R y ) 
but the difference is extremely small. 

E BLOCK R j 

(d) 101og in ( ^ i w ' x — x ' y ) . This shows the difference is there at intermediate low rates. 

11,1 fc SW,x(«x,K s ) 

After conditioning on the realized source sequence in (Ell, the remaining randomness is only in the binning. In \26\ 
we decompose the error event into a number of mutually exclusive events (see Fig [8} by partitioning all source 
sequences x n into sets J- n (l,x n ) defined by the time I of the first sample in which they differ from the realized 
source x n , 

T n {l, x n ) = {x n S X n \x l - X = x l -\ Xl ± xi} 7 (28) 
and define T n (n + l,x n ) — {x n }. Finally, in i27\ we define 

Pn (l) = J^Pr [3 x n e B x (x n ) n F n (l, x n ) s.t. p x (i n ) > Px (x n )] Px (x n ). (29) 

We now upper bound p n (l) using a Chernoff bound argument similar to [9]. 
Lemma 2: p n (l) < exp{ — (n — I + I)Eml(Rx)}- 
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iog(2) 
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0.1 

0.05 




(c) 



\ 



infinity 



iog(2)" 



log(2) 



(b) 



(d) 



Fig. 7. Error exponents plot for source y for fixed R y as R x varies: 
R y = 0.35: 

(a) Solid curve: Egw,y{Rx, Ry) and dashed curve E§^ OCK (R x ,R y ), E S w,y(Rx,Ry) < E§Jf CK (R x ,R y ), the difference is extremely 

' E BLOCK R j 

small. Ey(Ry) is because R y = 0.35 < H(y). (b) 10 log 10 ( ^ R ^ " ). This shows the two exponents are not identical everywhere. 

Ry = 0.49: 

(c) Solid curves: E S w,y(Rx, Ry), dashed curve E§ £P CK (R x , R y ) and E S w,y(Rx, Ry) < E§^,° y CK (R x , R y ) and E y (R y ) is constant 
shown in a dotted line. 

E BLOCK R j 

(d) 10 log 10 ( E ™^ — (_r " )■ Notice how the gap goes to infinity when we leave the Slepian-Wolf region. 



§§§§§§§§®@ « . . . 

1 n — A n 



Fig. 8. Decoding error probability at n — A can be union bounded by the sum of probabilities of first decoding error at I, 1 < I < n — A. 
The dominant error event p„(n — A) is the one in the highlighted oval(shortest delay). 
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Proof: 



Pn (l) =^>r [3 x n G ^(x") n^„G,x") s.t. p x (z") > p x (x n )] Px (x n ) 

X n 



px(x n ) < px(x n ) 



^2 min 1) E exp{-(n - I + l)R x } p x (x l 1 )p x (xf) 
Y min 1, expj>(ri - I + 1)R X } p x {x\ l ) 

xf s.t. 

:^min [l,J2l\p x (xf) > p x (xf)} exp{-(n - I + l)^}]^) 



(30) 
(31) 



(32) 



< min 



< 



E 



E 



i,5> 



1, 



exp{-(n-/ + l)i? :E } 



Px(z") 

E^w) 1 ^ 

X 



cxp{-{n-l + l)R x } 



(33) 



E[px(*d]^ 

-i (n-i+1) 



exp{-(n - / + l)/?-R x } 
-i (n-2+l)/> 



exp{-(n - Z + l)p-R a; } 
exp{— (n — £ + l)/}^} 



(34) 



= exp j — (n — / + 1) 
In (I30i the union bound is applied. In (|2 



pi?, - + In (j2p x (x)^ 



(35) 



we use the fact that after the first symbol in which two sequences 
differ, the remaining parity bits are independent, and the fact that only the likelihood of the differing suffixes matter. 
That is, if x l ~ x = x 1 ^ 1 , then p x (x n ) < p x (x n ) if and only if p x (xf) < Px(xf). In ( l32l 7(-) is the indicator function, 
taking the value one if the argument is true, and zero if it is false. We get J33I by limiting p to the range < p < 1 
since the arguments of the minimization are both positive and upper-bounded by one. We use the iid property of 
the source, exchanging sums and products to get (I34> . The bound in J35i is true for all p in the range < p < 1. 
Maximizing d35l over p gives p n {l) < exp{ — (n — I + 1)Eml(R x )} where Eml(R x )} is defined in Theorem |2] 
in particular (|9j. ■ 
Using Lemma |2] in J27> gives 

71- A 

Pr[x"- A ? x"- A ] < Y, cxp{-(n - I + 1)E ML (R X )} (36) 
i=i 

n— A 

= J2 cxp{-(n - / + 1 - A)E ML {R X )} cx P {-AE M l{R x )} 
i=i 

<K cxp{-AE ML (R x )} (37) 

In fl37i we pull out the exponent in A. The remaining summation is a sum over decaying exponentials, can thus 
can be bounded by some constant Kq. This proves Theorem |2] 
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B. Error events and sequential decoding 

To better understand the dominant error event in the sum d36i . consider constructing the ML estimate in a 
symbol-by-symbol sequential manner. The decoder starts by first identifying as candidates those sequences whose 
parities match the received bit stream up to time n. If the encoder observes the length-n sequence x — x, this is 
{x s.t. x £ £> x (x)}. The kh symbol of the estimate, xi, is defined as 

x i = Wi where w= argmax Px n (xf). (38) 

xeBx(x) s.t. x l - 1 =x 1 - 1 

The estimate thus produced is the maximum likelihood estimate because the decision regarding which pair of 
sequences is more likely depends only on which one's suffix is more likely. 

This is a decision-directed decoder. Semi-hard 13 estimate are made sequentially for each symbol. These estimates 
are then fixed, and taken as true when estimating subsequent symbols. Each such hard-decision is analogous to a 
classic block-coding Slepian-Wolf problem. This is because we only need to decide between sequences that start 
to differ in the symbol we are trying to estimate — previous symbols have been fixed, and subsequent symbols are 
not yet in question. Thus, all sequences that could lead to different estimates of symbol I are binned independently 
for the remainder of the block. This is why the error exponent we derive in d37i equals Gallager's block coding 
exponent [9]. Since the error exponent for each block-decoding problem is the same, the dominant error event is the 
hard-decision with the shortest block-length. This symbol is the last symbol we need to estimate. Its block-length 
equals the estimation delay A. We revisit this story in Section IVIll when we consider Slepian-Wolf coding. In that 
context the dominant error event has some features that do not arise in block coding. 

C. Universal decoding 

In this section we prove Theorem[3] We use the sequential decoder introduced in Section fV-BI but with minimum- 
entropy, rather than maximum-likelihood, decoding. That is, 

xi = wi[l] where /[(] = argmin H(xf). (39) 

x n &B x {x n ) s.t. x l - 1 =£ 1 - 1 

We term this a minimum suffix-entropy decoder. The reason for using this decoder instead of the standard minimum 
block-entropy decoder is that the block-entropy decoder has a polynomial term in n (resulting from summing over 
the type classes) that multiplies the exponential decay in A. For n large, this polynomial can dominate. Using the 
minimum suffix-entropy decoder results in a polynomial term in A. 

With this decoder, errors can only occur if there is some sequence x n such that (i) x n £ B x (x n ), (ii) x 1 ^ 1 = x 1 ^ 1 , 
and x; ^ xi, for some I < n — A, and (iii) the empirical suffix entropy of xf is such that if(x") < H(xf). Building 
on the common core of the achievability (I25i — (I27i with the substitution of universal decoding in the place of 
maximum likelihood results in the following definition of p n (l) (cf. J40i with i29\ . 

Pn (l) = Pr [3 x n G B x (x n ) n T n (l, x n ) s.t. H (xf) < H(xf)]p x (x n ) (40) 

x n 

The following lemma gives a bound on p n (l). 

Lemma 3: For minimum suffix-entropy decoding, p n (0 < (n — Z + 2) 2 I*I exp{ — (n — I + 1)Eun(R x )}. 
Proof: We define P n ~ l to be the type of length-(n — 1 + 1) sequence xf, and T pn -i to be the corresponding type 
class so that xf € Tp n -i. Analogous definitions hold for P n ~ l and xf - We rewrite the constraint H(xf) < H(xf) 

''Decisions are only "hard" for computational time. As soon as the next set of parities arrive and real-time advances, all the computations 
are done again. 
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as H(P n - 1 ) < H(P n ~ l ). Thus, 

Pn (l) =Y, Pl I 3 * n G B x (x n )nT n (l,x n ) s.t. H(x?) < H(xf)] Px ( 3 



< mil 

= E 

i— i n 
= mil 



^ E 



i. 



E 



cxp{-(n - I + 1)R X } p x (x l x )p x (x 



1, ^ exp{-(n-/ + l)i? x }]p x (a;n 



E E min 

p»-'xfer p „_, 



1. 



H{P r ' 



E 

') < ff(.p™ -i ) 



£ exp{-(n-Z+l)iUWzn 



Z + 2)1*1 exp{-(n - - ^(P"-')]}]^^) 



< min fl, (r 

p»-> x» +1 eV_, 

<(n- 1 + 2)1*1 ^ ^ e XV {-(n-l + l)[\R x -H(P n - l )\+}} 

exp{-(n - I + l)[D(P n - l \\p x ) + H(P n - 1 )}} 
<(n~l + 2)1*1 V exp{-(n - I + 1) inf[D( 9 |lp x ) + - ff(g)|+]} 



(41) 
(42) 
(43) 



<(n - / + 2) 2 I*I exp{-(n - Z + 1)^(^)1 



(44) 
(45) 

(46) 

In going from J42i to J43i first note that the argument of the inner-most summation (over x[ L ) does not depend 
on x. We then use the following relations: (i) J2 x n eT~ t = I^p»- ! — exp{(n — I + l)H(P n )}, which is a 
standard bound on the size of the type class, (ii) H(P n ~ l ) < H(P n ~ l ) by the minimum-suffix-entropy decoding 
rule, and (iii) the polynomial bound on the number of types, \{P n ~ l }\ < (n — I + 2)1*1. In (1441 we recall the 
function definition | • | + = max{0,-}. We pull the polynomial term out of the minimization and use p x (^™) = 
exp{-(n - I + l)[D{P n ~ l \\p x ) + H(P n ~ 1 )]} for all p x (xf) 6 T P „-z. It is also in 104} that we see why we use 
a minimum suffix-entropy decoding rule instead of a minimum entropy decoding rule. If we had not marginalized 
out over x 1 " 1 in (14 1 i then we would have a polynomial term out front in terms of n rather than n — I, which 
for large n could dominate the exponential decay in n — I. As the expression in (145 \ no longer depends on x] 1 , 
we simplify by using \T P „-i \ < exp{(n — I + l)H(P n ~ 1 )}. In ( I46i we use the definition of the universal error 
exponent Ejjn(Rx) from dlOl of Theorem [3] and the polynomial bound on the number of types. ■ 
Lemma|3]and Pr[x"~ A ^ x"~ A ] < p n (l) imply that: 

Pr[x"- A ± x"- A ] <J2(n-l + 2) 21 * 1 exp{-(n - I + 1)E UN (R X )} 



=i 

-A 



< J2 A 'i exp{-(n - I + l)[E UN {R x ) - 7 ]} 



(47) 
(48) 



<K 2 eyq,{-\[E UN (R x ) - 7 ]} 

In d47l we incorporate the polynomial into the exponent. Namely, for all a > 0, b > 0, there exists a C such that 
z a < Cexp{6(z - 1)} for all z > 1. 

We then make explicit the delay-dependent term. Pulling out the exponent in A, the remaining summation is a 
sum over decaying exponentials, and can be bounded by a constant. Together with K\, this gives the constant K 2 
in ( I48> . This proves Theorem Note that the 7 in d48l does not enter the optimization because 7 > can be 
picked equal to any constant. The choice of 7 effects the constant K in Theorem [3] 
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VI. Streaming source coding with side information at the decoder 



If a random sequence y n , related to the source x™ through a discrete memory less channel, is observed at the 
decoder, then this side information can be used to reduce the rate of the source code. In this model p x ^(x n , y n ) — 
n™=i Pxy( x ii Vi) = Y[7=iPx\y( x i\yi)Py(Vi)- Th e source x™ is observed at the encoder, and the decoder, which 
observes y™ and a bit stream from the encoder, wants to estimate each source symbol x; with a probability of error 
that decreases exponentially in the decoding delay A. 

We can apply the analysis of Section [V] to this problem with a few minor modifications. For ML decoding, we 
need to pick the sequence with the maximum conditional probability given y". The error exponent can be derived 
using a similar Chernoff bounding argument as in section [V] For universal decoding, the only change is that 
we now use a minimum suffix conditional-entropy decoder that compares sequence pairs (x n ,y n ) and (x n ,y n ). 
In terms of the analysis, one change enters in d25l where we must also sum over the possible side information 
sequences. And in J42> the entropy condition in the summation over x changes to H(xf +1 \yf +1 ) < H (xf +1 \y^ +1 ) 
(or the equivalent type notation). Since there is no ambiguity in the side information, since y™ is observed at the 
decoder, this condition is equivalent to H(xf +1 , yf, x ) < H(x'£, 1 ,yj\_ 1 ). 

These results are summarized in Theorems 0] and [5] We do not include the full derivation of these theorems as 
no new ideas are required. 



In this section we provide the proofs of Theorems [6] and [7] which consider the two-user 14 Slepian-Wolf problem. 
As with the proofs of Theorems [2] and [3] in Sections I V- Al and |V-CI we start by developing the common core of the 
proof in the context of maximum likelihood decoding. This allows us to develop the results for universal decoding 
more quickly and transparently. Furthermore, as shown in Theorem [8] maximum likelihood decoding and universal 
decoding provide the same reliability with delay. 

A. Maximum Likelihood Decoding 

In Theorems |5] and three error events are considered: (i) Pr[x™~ A ^ x"~ A ], (ii) Pr[y"~ A ^ y™~ A ], and (iii) 
Pr[(x" _A , y"~ A ) 7^ (x n ~ A , y"~ A )]. We develop the error exponent for case (i). The error exponent for case (ii) 
follows from a similar derivation, and that of case (iii) from an application of the union bound resulting in an 
exponent that is the minimum of the exponents of cases (i) and (ii). 

To lead to the decoding error Pr[x"~ A ^ x™~ A ] there must be some spurious source pair (x n ,y n ) that satisfies 
three conditions: (i) x n € B x (x n ) and y n £ B y {y n ), (ii) it must be more likely than the true pair Px,y(x n ,y n ) > 
Px,y(x n , y n ), and (iii) xi ^ x\ for some I < n — A. 

The error probability is 



VII. Streaming Slepian-Wolf source coding 



Pr[x"- A ji x"- A ] = Pr[*"~ A + ^"- A |x" = x n ,y n = y n ]p^{x n , y n ) 



x n ,y" 

n—A 



T. n .n n 7 — 1 I 1 



x n ,y n 1=1 k=l 



Pr [3 (x n ,y n ) e B x (x n ) x B y (y n ) n T n {l, k, x n , y n ) s.t. p^ y (x n , y n ) > Px,y(^ n , y n )] } (49) 



n-A n+1 



/ — I I 1 T. n .H n 



1=1 k=l x n ,y 



Pr [3 (x n ,y n ) G B x {x n ) x B y (y n ) n T n (l, k, x n , y n ) s.t. p x , y (x n , y n ) > Px,y(^", V n )] } 



n-A n+1 




1=1 k=l 



The multiuser case is essentially the same, just with a lot more notation and minimization parameters 71,72, ■ • •■ 
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In ( I49> we decompose the error event into a number of mutually exclusive events by partitioning all source pairs 
(x n , y n ) into sets !F n (l, k,x n ,y n ) defined by the times I and k at which x n and y n diverge from the realized source 
sequences. The set T n {l, k, x n , y n ) is defined as 

F n (l,k,x n ,y n ) = {(x n ,y n ) G x T s.t x 1 ' 1 = x l ~ l ,x, + x^y^ 1 = y k ~\y k ± y k }, (51) 

In contrast to streaming point-to-point or side-information coding (cf. (15 1 i with J28» . the partition is now doubly- 
indexed. To find the dominant error event, we must search over both indices. Having two dimensions to search 
over results in an extra minimization when calculating the error exponent (and leads to the infimum over 7 in 
Theorem |6j. 

Finally, to get ( 15 01 we define p n (l, k) as 

Pn(h k) 

= E P*,y(z n .y n ) Pr [ 3 e B x {x n ) x B y (y n )nT n (l,k,x n ,y n ) s.t. Px , y {x n ,y n ) >Px, y (^ n ,y") 



The following lemma provides an upper bound on p n (l, k): 
Lemma 4: 



p n (l,k) < ex.p{-(n-l + l)E x (R x ,R y ,^L 1 )} if l<k, 
p n (l,k) < e^{~(n-k + l)E y (R x ,R yiT ^r)} if I > k, 



(52) 



where E X (R X , R y ,j) and E y (R x , R y ,j) are defined in dl 31 and dl4> respectively. Notice that l,k < n, for I < k: 
ra ^.7+i e lP> 1] serves as 7 m the error exponent E X (R X , R y , 7). Similarly for 2 > k. 
Proof: The bound depends on whether I < k or I > k. Consider the case for I < k, 

p n (l,k) 

= E Px, y (^,y")Pr[3(i",y") GB x (x n ) x B,^) nf (l (^,x^ ^ ,' , ) s.L ft , y (/, y") < ft , y (^, f )] 



< min 



1. 



< min 

xf s3 /f 



(x rt , y' 1 ) G ^ n (i, fc, X ' L , y' L ) 
Px,y(^ n > V n ) < Px,y 71 , y n ) 



x i ' yj > sX 



E Pr [^ e B * g B v (»*)] P*Ax n ,y n ) 

) < Pi 

E exp{-(n - Z + 1)^ - (n - fe + l)i^} p X) y(a:p,yP) 

> < ?> 

E min [l) E cx P{~( n _ 1 + l ) R x -(n-k + l)Ry} 

xf,yf xf,yl 

i[p^{i k r\y*- i ) P ^{ii 1 yz) > A, y (itr,»r)]]ft,>wi/r) 

1, E exp{-(n - 2 + 1)_R X - (n — k + l)R y } 

Px, y (xf,yf) 



(53) 
(54) 



< min 

xf.j/f 



c i <Vk 



mm 



Px,yOP>yP) 



< E 



Ee 

x",y£ 



-{n-l+l)Rn-(n-k+l)Ry 



e 



-(n-(+l)pfl x -(n-fe+l)pfl„ 



E 



Px,y(a;p,yP) 



P^P,!/")^' 



(55) 
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= e -(n-l+l) P R x -(n-k + l) P R y £ [ £ ^(af" 1 , rf" 



„fc— 1 



~fe— i 



an) 1 



= e -("-i+l)P«» 



E 



flt,y(a#,l#) l +' 



-(n-fc+l)pflj, 



E E 1 ' 1 ' 



i+p 



E 

x k>Vk 



_ e -(n-l+l)pR x -(n-k+l)pR v 



EIE*v ( 

V 



exp ■ 



-(k-l) 



pRx 



X 



1+p 



1+p 



k-l 



P*,y( x k>Vk) 1+ '' 



1+P 



(l+p)(n-k+l) 



(56) 



exp ■ 



-(n 



1) 



p{R x + R y ) - (1 + p) log \^ Px , y {x,y)^e 



exp{-(fc - l)E x \ v {R x ,p) - (n- k 

( , a k-l 

exp < — (n — I + 1 



Tp 

n — I + 1 



(#*,p) 



l)E X y(R X , Ry, p)} 

n - fc + 1 



< exp < — (n 



exp • 



Z + 1) sup 
pe[o.i] 

ML 



fc-Z 

-Z + l 



E x \ y (R x , p) 



E xy (R x , R y , p) 
-k + 1 



n _ l + 1 -*»v"*»'* 



Z + l 



-(n - Z + ^ iZ^-Ry, 



k-l 
n-l + 1 



exp 



E xy (R x , R y , p) 



I + l)E x (R x , R y , 



n-l + 1' 



(57) 
(58) 

(59) 

(60) 



In (I53i we explicitly indicate the three conditions that a suffix pair (xf,y^) must satisfy to result in a decoding 
error. In (I54> we sum out over the common prefixes (ar 1 , J/ ), and use the fact that the random binning is done 
independently at each encoder, see Definition. |2] We get \55\ by limiting p to the interval < p < 1, as in J33I . 
Getting (I56> from J55i follows by a number of basic manipulations. In (I56> we get the single letter expression by 
again using the memoryless property of the sources. In d57i we use the definitions of E x \ y and E xy from (II 41 
of Theorem |6] Noting that the bound holds for all p £ [0, 1] optimizing over p results in ( 1591 . Finally, using the 
definition of (1131 and the remark following Theorem [8] that the maximum-likelihood and universal exponents are 
equal gives d60l The bound on p ra (Z, fc) when Z > k, is developed in an analogous fashion. ■ 

We use Lemma |4] together with (I50i to bound Pr[x"~ A ^ x"~ A ] for two distinct cases. The first, simpler case, 
is when inf 7g r 0) i] E y (R x , Ryi'y) > inf 7g r 0)1 i E X (R X , R y , 7). To bound Pr[x™~ A ^ x"~ A ] in this case, we split 
the sum over the p n (l, k) into two terms, as visualized in Fig [9] There are (n + 1) x (n — A) such events to 
account for (those inside the box). The probability of the event within each oval are summed together to give 
an upper bound on Pr[x™~ A 7^ x™~ A ]. We add extra probabilities outside of the box but within the ovals to 
make the summation symmetric thus simpler. Those extra error events do not impact the error exponent because 
inf 7e r 0l x] E y {R x ,Ry,p, r y) > inf 7£ [ (1 i E X (R X , R y , p, 7). The possible dominant error events are highlighted in 
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Figure [9]. Thus, 

n— A n+1 n— A n+1 

Pr[x"- A ± x"- A ] < £ ^p„(/,fc)+ £ E^' fc ) ^ 

1=1 k=l k=l l=k 

n— A n+1 n— A n+1 



< V y exp{-(n-Z + l) inf ^(^,^,7)} + V V exp{-(n-fc+l) inf ^(i^, Ry, 7)} (62) 

n-A 

= V [(n-/ + 2)exp{-(n-/ + l) inf E x (R x ,Ry, 7 )} 

1 7G[04] 
n-A 

+ E [(«-fc + 2)cxp{-(n-fc + l) inf E y {R x ,R y ,j)} 

k—1 

n-A 

<2 V [(n-Z + 2)exp{-(n-Z + l) inf £^,^,7)} (63) 
r~: L 7e[o,i] 



1=1 

i-A 



< V C ieX p{-(n-/ + 2)[ inf E x (R x ,R vn )-a}} (64) 
1=1 76[0 ' 1] 

< C 2 exp{-A[ inf E X (R X: R y:7 ) - a}} (65) 

7G[0,1] 

Equation ( 16 1 1 follows directly from ( I50> . in the first term I < k, in the second term Z > fc. In j62L we use 
Lemma [4] In ( 1631 we use the assumption that iniLgmi] E y (R x , R y , 7) > inf 7e r 0) i] E X (R X , Ry, 7). In d64t the 
a > results from incorporating the polynomial into the first exponent, and can be chosen as small as desired. 
Combining terms and summing out the decaying exponential yield the bound i65\ . 

The second, more involved case, is when mi je [ 01 ]E y (R x ,Ry,p,j) < inf 7€ r 0) i] E x {Rx, Ry, p, 7). To bound 
Pr[x ,l_A 7^ x"~ A ], we could use the same bounding technique used in the first case. This gives the error exponent 
inf 7e [o,i] E y (R x , R y ,j) which is generally smaller than what we can get by dividing the error events in a new 
scheme as shown in Figure [H)] In this situation we split (I50i into three terms, as visualized in Fig Ell Just as in the 
first case shown in Fig|9] there are (n+1) x (n — A) such events to account for (those inside the box). The error 
events are partitioned into 3 regions. Region 2 and 3 are separated by k*(l) using a dotted line. In region 3, we 
add extra probabilities outside of the box but within the ovals to make the summation simpler. Those extra error 
events do not affect the error exponent as shown in the proof. The possible dominant error events are highlighted 
shown in Fig ^| Thus, 

n-An+1 n-A f-1 n-Ak*(l)-l 

Pr[x"- A jt X"- A ] < Y Y / Pn(l,k)+ Y E Pn(hk)+ E E ^' fc ) < 66 > 
1=1 k=l 1 = 1 k=k*(l) 1=1 k=l 

Where Y^l=iPk = 0. The lower boundary of Region 2 is k*(l) > 1 as a function of n and I: 

r^maxll^+l-r 1 ^' ' 11 ^ (67) 

I m I 7 e[0,l] -kyl-ftx, Ry,l) J 

where we use G to denote the ceiling of the ratio of exponents. Note that when inf 7e [ .i] E y (R x , R y , 7) > 
inf 7 g[o,i] E X (R X , R y , 7) then G = 1 and region two of Fig. 1101 disappears. In other words, the middle term 
of J66> equals zero. This is the first case considered. We now consider the cases when G > 2 (because of the 
ceiling function G is a positive integer). 

The first term of J66L i.e., region one in Fig. ^3 where I < k, is bounded in the same way that the first term 
of do*ll is, giving 

n-A n+1 

V Vp„(Z,fc) <C 2 exp{-A[ inf E X (R X , R y ,j) - a}}. (68) 
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Fig. 9. Two dimensional plot of the error probabilities p„(l,k), corresponding to error events (i,fe), contributing to Pr[x n A ] in 

the situation where inf 7e [ 01 ] E y (R x , R y , p,"/) > inf 7£ [ 01 ] E X (R X , R y , p,-y). 

In Fig. region two is upper bounded by the 45-degree line, and lower bounded by k*(l). The second term 
of ( I66> . corresponding to this region where / > k, 

n-A l-l n—A l-l ^ , 

E E P«M)<E E exp{-(n - k + l)^(fi z ,i^, - _~ )} 

1 = 1 k=k*(l) 1=1 k=k*(l) 

n-A j— 1 _ , . i 7 

= E E ex P^("- fc + 1 )!^TTT^(^'^'^-T)} < 69 > 

1 = 1 k=k'(l) 



n-A l-l ^ 

<E E exp{-(n-Z + l) inf E y (R x ,R yil )} 

, . TE 0,1 1 — T 



i=i k=k*(i) "re [o,i] 7 

n-A . 

y)(i-fc*(0)exp{-(n-i + l) inf ^(i? X)J R„ 7 )} 

7 e[o,i]l-7 



In d69l we note that Z > k, so define -Mrr = 7 as in JTOb. Then n ~ k ,i} = -r— 

*— ' — ' n — fc+1 ' " r ro — 1+1 1 — 7 



(70) 
(71) 



The third term of J66b . i.e., the intersection of region three and the "box" in Fig.^|where I > fc, can be bounded 
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Fig. 10. Two dimensional plot of the error probabilities p n (l, k), corresponding to error events (I, k), contributing to Pr[x™ A ^ x" 
the situation where inf 7e [ ,i] E y (Rx, Ry, 7) < mf 7S [ ,i] E X (R X , R y , 7). 



as, 

ri-A fc*(Q-l ri+1 min{i,fc*(n-A)-l} 

X) E E ^> fc ) < 72 > 

i=l fe=X (=1 fc=l 

fc*(n-A)-l n +i 

= E Erf fc ) < 73 > 
fc=i (=fc 

fe*(n-A)-l n +l 

^ E E ex p{-( n - fc+1 )*W'^' ZTTT )} 

fc=i z=fe n 

fe*(n-A)-ln+l 

< V V exp{-(n-fc + l) inf EyiR^Ry^)} 

rl rf 7e[o,i] 

fe*(n-A)-l 

< V (n- fc + 2)exp{-(n- fc + 1) inf E y {R x ,R y , 1 )} (74) 
fel 7e[0 < 1] 

In (E3 we note that i < n - A thus fc*(n - A) - 1 > fc*(Z) - 1, also i > 1, so Z > fc*(Z) - 1. This can be 
visualized in Fig as we extend the summation from the intersection of the "box" and region 3 to the whole 
region under the diagonal line and the horizontal line fc = k*(n — A) — 1. In (1731 we simply switch the order of 
the summation. 
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Finally when G > 2, we substitute ( I68> . ( 17 U . and d74l into ( I66l l to give 

Pr[x"- A ^x"- A ] < C 2 exp{-A[ inf E X (R X , Ry, 7) - a}} 



76[0,1 

n-A 



" " 1 

+ V(/-fc*(0)exp{-(n-/ + l) inf E y {R x ,R y , 7 )} (75) 

f-^ 7 e[o,i]l-7 



z=i 

fc*(n-A)-l 

+ V (n - k + 2)exp{-(n - fc + 1) inf E y (R x ,R y ,j)} 

,1 7G[0,1] 

AC— 1 

<C 2 exp{-A[ inf E x (R x ,Ry,j) -a]} 

7£[0,1] 



(76) 



(77) 



»i-A ^ 

+ ^(«-n-l + G(n + l-0)exp{-(n-« + l) inf -—^(^,^,7)} 

1=1 7£[0,1] 7 

n+l-G(A+l) 

+ (n-fc + 2)cxp{-(n-fc + l) inf E y (R x ,R y , 7)} 

fe=l 76[0,1] 

<C 2 cxp{-A[ inf E x (R x ,R y , 7 ) -a}} 

7G[0,1] 

+ (G-l)C 3 exp{-A[ inf -J—^^, J^ )7 ) - a] } 

7£[0,1] 1 — 7 J 

+ C 4 exp{-[AG inf E y (R x , R y:1 ) - a]} 

7G[0,1] 

<C 5 exp(-A mini inf E x (R x ,R y ,^), inf E y (R x , R y , 7)} - a }. 

I L l-7G[0,l] 7£[0,1]1— 7 J J J 

To get i76\ . we use the fact that k* (I) > n + 1 — G(n +1 — 1) from the definition of k* (I) in (I67> to upper bound the 
second term. We exploit the definition of G to convert the exponent in the third term to inf 7g [ 0i i] E X (R X , R y , 7). 
Finally, to get ( 1771 we gather the constants together, sum out over the decaying exponentials, and are limited by 
the smaller of the two exponents. 

Note: in the proof of Theorem [6] we regularly double count the error events or add smaller extra probabilities 
to make the summations simpler. But it should be clear that the error exponent is not affected. 

B. Universal Decoding 

As discussed in Section fV-CI we do not use a pairwise minimum joint-entropy decoder because of polynomial term 
in n would multiply the exponential decay in A. Analogous to the sequential decoder used there, we use a "weighted 
suffix entropy" decoder. The decoding starts by first identifying candidate sequence pairs as those that agree with 
the encoding bit streams up to time 11, i.e., x n £ B x (x n ),y n 6 B y (y n ). For any one of the \B x (x n )\\B y (y n )\ 
sequence pairs in the candidate set, i.e., (x n ,y n ) 6 B x (x n ) x B y (y n ) we compute (n + 1) x (n + 1) weighted 
entropies: 

H S (1, k, x n , y n ) = H yl n+1 - l) ), I = k 

H s (l,k,x n ,y n ) = k ~ l . gffi- 1 ^- 1 ) + n+ M 1 ~ k H(x^y^), I < k 
n + 1 — I n + 1 — I 

H s (l,k,x n ,y n ) = n |~ fc ^ Ot 1 !^ 1 ) + l±±=± H(x?,y?), I > k. 
We define the score of (x n ,y n ) as the pair of integers i x (x n ,y n ), i y (x n ,y n ) s.t., 

i x (x n , y n ) = max{J : H s (l, k, (x n , y n )) < H s (l, k, x n , y")Vfc = 1, 2, ...n + 1, VZ = 1, 2, ...i, 

V(5", y n ) G B x {x n ) x B y (y n ) n T n {l, k, x n , y n )} (78) 
i y (x n ,y n ) = max{i : H s (l,k,(x n ,y n )) < H s (l,k lX n ,y n )yi = 1, 2, ...n + 1, Vfc = 1, 2, ...i, 

V{x n ,y n ) 6 B x {x n ) x B y (y n )nMhk,x n ,y n )} (79) 
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Fig. 11. 2D interpretation of the score, (i x (x n , y n ),i y (x n ,y n )), of a sequence pair (x n , y n ). If there exists a sequence pair in JF„(Z, k,x n ,y n ) 
with less or the same score, then (I, k) is marked with a solid dot. The score i x (x n , y n ) is the largest integer which is smaller than all the 
^-coordinates of the marked points. Similarly for i y (x n , y n ), 



While J- n (l, k,x n ,y n ) is the same set as defined in ( 15 H . we repeat the definition here for convenience, 

T n (l, k, x n , y n ) = {{x n ,y n ) € X n x y n s.t. x 1 " 1 = x l ~\x t + x u y k ' x = y k -\y k ± y k }. 

The definition of (i x (x n ,y n ),i v (x n ,y n )) can be visualized in the following procedure. As shown in Fig. II II 
for all 1 < l,k < n + 1, if there exists G J 7 n {l,k,(x n ,y n )) D B x (x n ) x B v (y n ) s.t. H s (l,k,x n ,y n ) > 

Hs(l,k,x n ,y n ) , then we mark (l,k) on the plane as shown in Fig^J Eventually we pick the maximum integer 
which is smaller than all marked x-coordinates as i x {x n ,y n ) and the maximum integer which is smaller than all 
marked y-coordinates as i y (x n ,y n ). The score of (x n ,y n ) tells us the first branch(either x or y) point where a 
"better sequence pair" (with a smaller weighted entropy) exists. 

Define the set of the winners as the sequences (not sequence pair) with the maximum score: 

W x n = {x n e B x {x n ) : 3y n e B y (y n ), s.t.i x (x n ,y n ) > l;c (i™,y"),V(x n ,y") e B x (x n ) x B y (y n )} 
Wl = {y n e B y (y n ) : 3x n G B x (x n ), s.U y (x n ,y n ) > ,» n ),V(x B ,y n ) G B x (x n ) x B y (y n )} 
Then arbitrarily pick one sequence from and one from as the decision (x n , y n ). 

We bound the probability that there exists a sequence pair in J- n (l, k, (x™,y™)) <~)B x (x n ) x B y (y n ) with smaller 
weighted minimum-entropy suffix score as: 

Pn (l,k) = ^^p^^.OT^.^eB,^) xB v (y n )nf n (l,k,x n ,y n ), 

S .t.H s (l,k,x. n ,y n )<H s (l,k,(x n ,y n ))) 

Note that the p n (l,k) here differs from the p n (l,k) defined in the ML decoding by replacing p xy (x n ,y n ) < 
Pxy(i r \y n ) with H s (l,k,x n ,y n ) < H s (l,k,(x n ,y n )). 



25 



The following lemma, analogous to ( 15 Oi l for ML decoding, tells us that the "suffix weighted entropy" decoding 
rule is a good one. 

Lemma 5: Upper bound on symbol-wise decoding error P ex (k, k + d) : 



Pr[x»- A ^ x»- A ] < J2 E^> fc ) 
1=1 k=i 

Proof: According to the decoding rule, x n ~ A ^ x n ~ A implies that there exists a sequence x n S s.t.i"- A ^ 
x n ~ A . This means that there exists a sequence y n £ B y (y n ), s.t. i x (x n , y n ) > i x (x n , y n )- Suppose that (x n , y n ) € 
T n {l, k, x n ,y n ), then I < n — A because x n ~ A ^ x n ~ A . By the definition of i x , we know that Hs(l, k, x n ,y n ) < 
Hs(l, k,x n ,y n ). And using the union bound argument we get the desired inequality. ■ 
We only need to bound each single error probability p n (l, k) to finish the proof. 
Lemma 6: Upper bound on p n (l, k), I < k: V7 > 0, 3Ki < 00, s.t. 

P„(l, k) < cxp{-(n - I + l)[E x {R x ,R y , A) - 7]} 

where A = (k - l)/{n - I + 1) e [0, 1]. 

Proof: Here the error probability p n {l, k) can be thought as starting from i54\ with the condition (k— l)H(Xi 1 |yf + 
{n-k + l)H{xl,yl) < (k-l)H(x't- x \y , t- 1 ) + {n-k+l)H(xl,yl) substituted for p(xf,yf) > p(xf,yf), we 
get 

Pn (i,k)= E E E ^ I 1 ' E 

pn-t_pk-ly.-l.yn y?- 1 £T k _,, «* " 1 e T k _ , (y k ~ 1 ) , " V" ~ k , V k ~ 1 , P"~ k s.t. 

^ I] E ^ V {-{n-l + \)R x ~{n-k + l)Ry]\p xy {x n ,y n ) (80) 

y£eTp„_ fc s ! fc - 1 e r <?)t _,(i, ( ' ! " I ) £ fe er v»-'=W) 

In d80i we enumerate all the source sequences in a way that allows us to focus on the types of the important 
subsequences. We enumerate the possibly misleading candidate sequences in terms of their suffixes types. We 
restrict the sum to those pairs (x n ,y n ) that could lead to mistaken decoding, defining the compact notation 
g^pn-k^pk-l^yn-k^yk-l^ A ( fc _ l)H{V k ~ l \P k - 1 ) + (n - k + l)H(P n ~ k x V n ~ k ), which is the weighted 
suffix entropy condition rewritten in terms of types. 

Note that the summations within the minimization in J80i do not depend on the arguments within these sums. 
Thus, we can bound this sum separately to get a bound on the number of possibly misleading source pairs (x,y). 

E E E E 

vn k vk l pn k gjer^„_, $*- 1 gr^_ 1 ( v *- 1 )S5J6 r v»-*(i'* 

S(P rt '~ , P , V n ~ , V ) < 
- k pk — I i/ti — k \/k — I \ 

E E I^-hs?- 1 )!! 7 ^"-*®*)! < 81 > 

p n -k s , y k eTp n _ h 

I v n — k v k — l^ < 

\Tp n _ k \eM(k~l)H(V k - l \P k - l )}cxp{(n-k + l)H(V n - k \P n - k )} (82) 

■y Tl - k y k I J} TL — k g a 

S(P n ~ k , P k ~ 1 , V n ~ k , V k ~~ 1 ) < 
S(P Tl — k ; P k ~ l , V n ~ k , V k ~ l ) 

< exp{{k-l)H(V k - l \P k - l ) + (n-k + l)H{P n - k xV n - k )} (83) 

1/ Tt fc y k / JZ>Tl — k * 

5 ( ^ — \/ ^ \/"^ ^ <^ 

5(pTi — fc ) pk — I ( v n-fc ; 

< ^ exp{(fc-/)i7(U fc - z |P fc -') + (n-fc + l)iJ(F"- fc x T/"- fe )} (84) 

— y"fc — i pn — k 

<(n-l + 2f xlm exp{(fc - 0-ff(F fe -^|P fc -') + (n - fc + l)H(P n - k x ^"- fe )} (85) 



< 



S(P n ~ k y p k - l ^v n ~ k .V k ~ l ) < 
5 ( J 1 * ^ ^ V"^ *) 
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In {82) we use standard bounds, e.g., |7^ fc _,(y ; fc 1 )| < exp{(fc 



In (|8lJ we sum over all £, fe_1 e (yf -1 ) 

i)-ff(V' fe_ '|P fe_i )} since 2/i e 7 P fc - 1 - We also sum over a11 5 fe e and over a11 Vk e r p„- fc in i82l 

By definition of the decoding rule (x, y) can only lead to a decoding error if (ft — l)H(V k ~ l \P k ~ 1 )] + (n — k + 
\)H(P n - k x V™- fc ) < (ft - l)H{V k - l \P k ~ l ) + (n - ft + l)H{P n - k x F^). In <S3J we apply the polynomial 
bound on the number of types. 

We substitute d85l into < I80I > and pull out the polynomial term, giving 



Pn (i,k)<{n-i+2f\ x \\y\ E 



E 



-k pk — l yrt- 



»«? £ T pn-fc 



E 



'("I 1 ! 



l,exp{-(ft - l)[R x - H{V k - l \P k - 1 )] - (n - ft + 1)[R, X + R y - H(V n - k x P n - k )}} p x « >yr (x?,y?) 



<("-! + r iW E E 

pn — k pk — l yn-fc yk — l 



■{ 



exp < max 



0, -(ft - l)[R x - H(V k - l \P k - 1 )} - (n - ft + + i? y - J ff(V^"- fe x P 



i — k\ 



} 



exp{-(k-l)D(V k - 1 x P k - l \\ Pxy ) - (n-k + l)D(V 



7-n — k 



x P 



n-k 



\Vxy)} 



(86) 



<{n-l + 2f x W y \ E exp{-(n-l + l)[\D(V k - 1 xP k - l \\ Pxy ) + \D{V n - k xP n - k \\ Pxy ) 



pn — k pk — l yn — k yk — l 

\\[R X - H{V k - l \P k - 1 )] + X[R X + R y - H(V n - k x P n - k )}\ + } 



(87) 



{-< 



n — I + 1) _ inf 

x,y,x,y 



\D(pt~y\\p xy ) + \D(p X< y\\ Px y) 



<(n-l + 2) 2 ^W J2 E ex P 

pn — k pk — l yn — k yk — l 

+ \X[R X - H(x\y)] + \[R X + R y - iJ(x,y)]| + ] } 
<(n - I + 2) 4 I*I^I exp{-(n - I + l)E x {R x , R y , A)} < K x exp{-(n - I + 1 )[E X { R x , R y , A) - 7]} 



(88) 

(89) 
(90) 

In d86i we use the memoryless property of the source, and exponential bounds on the probability of observing 
(xf _1 ,j/f _1 ) and (x%,y%). In (J87) we pull out (n-/ + l) from all terms, noticing that A = (k-l) / (n-l + 1) e [0, 1] 
and A = 1 - A = (n - k + \)/{n - I + 1). In (|88) we minimize the exponent over all choices of distributions 
p x _y and p x _y. In d89l we define the universal random coding exponent E X (R X , R y , A) = mf X y :X: y{XD(p X y\\p xy ) + 
XD(p Xyy \\p xy ) + \\[R x - H(x\y)] + X[R X + R y - H(x,y)]\ + } where < A < 1 and A = 1-A. We also incorporate 
the number of conditional and marginal types into the polynomial bound, as well as the sum over k, and then push 
the polynomial into the exponent since for any polynomial F, WE, e > 0, there exists C > 0, s.t. F(A)e~ AE < 

Ce -A(E-e) _ ■ 

A similar derivation yields a bound on p n (l, k) for I > k. 

Combining Lemmas [6] and |5] and then following the same derivation for ML decoding yields Theorem 



VIII. Future Directions 

A. Stationary-ergodic sources and universality 

[12] extends the block-coding proofs to the Slepian-Wolf problem for stationary-ergodic sources using AEP 
arguments. To have a similar extension to the streaming context, possibly additional regularity conditions will be 
required so that error exponents can be achieved. To achieve universality over sources, it is possible that further 
technical restrictions will be required. For the case of distributed Markov sources however, it seems quite clear 
that all the arguments in this paper will easily generalize. In that case, following the approach we take in [13], 
the source can be "segmented" into small blocks and the endpoints 15 of the blocks can be encoded perfectly at 
essentially zero rate. Conditioned on these endpoints, the blocks are then iid, with the endpoints representing a 
third stream of perfectly known side-information. 

15 For a Markov source of known order k, the endpoint is just k successive symbols at the end of the block. 
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B. Upper bounds and demonstrating optimal delays 

This paper dealt entirely with achievability of certain error exponents. Ideally, we would have corresponding 
upper bounds demonstrating that no higher exponents are possible. In the block-coding case, problem 3.7.1 in [8] 
provides a simple upper-bound. However, the nature of the error exponents in the streaming case might be more 
complicated. [6] provides an upper bound and matching achievable scheme for point-to-point source-coding with 
delay and this bound extends naturally to the case where side-information is known at both the encoder and the 
decoder. [14] provides an upper bound for the case of side-information known only at the decoder, and this bound 
is tight for certain symmetric cases. However, both of these extended single encoder arguments from [15] that do 
not immediately generalize to the case of multiple encoders. 

C. Trading off error exponents for the different source terminals 

For multiple terminal systems, different error exponents can be achieved for different users or sources. For channel 
coding, the encoders can choose different distributions while generating the randomized code book to achieve an 
error exponent trade-off among different users. In [16], the error exponent region is studied for the Gaussian 
multiple access channel and the broadcast channel within the block-coding paradigm. It is unclear whether similar 
tradeoffs are possible within the streaming Slepian Wolf problems considered here since there is nothing immediately 
comparable to the flexibility we have in choosing the "input distribution" for channel coding problems. 

D. Adaptation and limited feedback 

An interesting extension is to adaptive universal streaming Slepian Wolf encoders. The decoders we use in this 
paper are based on empirical statistics. Therefore they can be used even if source statistics are unknown. The current 
proposal will work regardless of source and side information statistics as long as the conditional entropy H(x\y) is 
less than the encoding rate. Even if there is uncertainty in statistics, the anytime nature of the coding system should 
enable the system to adapt on-line to the unknown entropy rate if some feedback channel is available. The feedback 
channel would be used to order increases (or decreases) in the binning rate. An increase (or decrease) could be 
triggered by examining the difference between two quantities: the minimal empirical joint entropy between the 
decoded sequence and observation, and the empirical joint entropy between the particular sequence and observation 
yielding the second-lowest joint entropy. If there is a large difference between these two entropies, we are using 
rate excessively, and the rate of communication can be reduced. If the difference is negligible, then it's likely we 
are not decoding correctly. Our target should be to keep this difference at roughly e. In the current context, this is 
analogous to the rate margin by which we choose to exceed the known conditional entropy. 
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Appendix 

In this section we show that the maximum likelihood (ML) error exponent equals the universal error exponent. 
We show that for all 7, 

E™ L (R x ,R y ,~t) = E^ N (R x ,Ry, 7 ) 

Where the ML error exponent: 

E^ L {R x ,R y , 1 ) = sup { 1 E Av {R x ,p) + {l- 1 )E xy {R x ,R y ,p)} 
pe[o,i] 

= sup {p J RW- 7 log(^(^p xy ( a; ,y)^) 1 +P)-(l- 7 )(l + p)log(^^p xy (x,y)^)} 

Pel ' 1 ! y x y x 

= sup {E^ L (R x ,R y ,^,p)} 

P6[0,l] 
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Write the function inside the sup argument as E X IL (R X , R y ,j, p). The universal error exponent: 

E™(R x ,R yi >y) = inf {-fD(q xy \\p xy ) + (1 - -f)D(o xy \\ Pxy ) 

+ max{0, j{R x - H{q x \ y )) + (1 - 7 )(i? :X + R y - H{o xy ))}} 
= inf {-fD(q xy \\p xy ) + (1 - j)D(o xy \\p xy ) + max{0, i? (7) -jH(q x \ y ) ~ (1 -7)^(0^)}} 

Here we define i?' 7 ^ = ^R x + (1 — r )){R x + R y ) > fH(p x \ y ) + (1 — r y)H(p xy ). For notational simplicity, we 
write q xy and o xy as two arbitrary joint distributions on X x y instead of p xy and p xy . We still write p xy as the 
distribution of the source. 

Before the proof, we define a pair of distributions that we will need. 

Definition 4: Tilted distribution of p xy : p p xy , for all p G [—1, oo) 



The entropy of the tilted distribution is written as H(p xy ). Obviously p xy = p xy . 
Definition 5: x — y tilted distribution of p^: p^, for all p G [—1, +oo) 

pP ( } = E s Px y (.s,y)^] 1+ P x p xy {x,y)^ 

A(y,P) C{x,y,p) 
B(p) X D(y,p) 



Where 



A(y,p) = [Y,Pxy{s,y)M 1+P = D(y, P ) 1+p 

S 

s t y 

C{x,y,p) = p xy (x,y)^ 
D{y,p) = ^Pxy{s,y)~? =^2C{x,y,p) 

S X 

The marginal distribution for y is ■ Obviously p xy = p xy . Write the conditional distribution of x given 

y under distribution p p y as p x \ y , where p^ y (x,y) = °^ y v ^ , and the conditional entropy of x given y under 
distribution p p y as H{p p x]y ). Obviously H{p x]y ) = H{p x \ y ). 
The conditional entropy of x given y for the x — y tilted distribution is 

rr t -p x V^C^yiP), fC(x,y,p) 



We introduce ^4(y, p), -B(p), C(x, y, p), D(y, p) to simplify the notations. Some of their properties are shown in 
Lemma ^3 

While tilted distributions are common optimal distributions in large deviation theory, it is useful to contemplate 
why we need to introduce these two tilted distributions. In the proof of Theorem [8] through a Lagrange multiplier 
argument, we will show that {p p y : p G [—1, +oo)} is the family of distributions that minimize the Kullback— Leibler 
distance to p xy with fixed entropy and {p p y : p G [— 1, +oo)} is the family of distributions that minimize 
the Kullback— Leibler distance to p xy with fixed conditional entropy. Using a Lagrange multiplier argument, we 
parametrize the universal error exponent E X N (R x , R y , 7) in terms of p and show the equivalence of the universal 
and maximum likelihood error exponents. 
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Now we are ready to prove Theorem|8] E* IL (R x ,R y , 7) = E^ N (R X , R y , 7). 
Proof: 

A. case 1: jH(p x]y ) + (1 - j)H (p xy ) < fl« < 7 # (P x | y ) + (1 - 7 )# (*>xy)- 
First, from Lemma IT6l and Lemma 1771 

dE^(R x ,R y , llP ) = Rh} _ (1 _ 7)jff ^ } 

Then, using Lemma and Lemma ITTI we have: 

d 2 E* IL (R X ,Ry,J,p) 



dp 



< 



So p maximize E X IL (R X , R y ,j, p), if and only if: 

= ^ /£ (y,7,P) = RiJ) _ iH{pP ^ _ (1 _ 7)Jf ^ } (91) 

Because i?^ 7 ^ is in the interval [7i7(p x | y ) + (1 — ^)H(p xy ), ^H(p\ y ) + (1 — ^)H(p xy )] and the entropy functions 
monotonically-increase over p, we can find p* £ (0, 1), s.t. 

7 #(pJ y ) + (l-7)ff«)=i? (7) 
Using Lemma IT4l and Lemma [T31 we get: 

E^ L (R x ,Ry, 7 ) = 7 £>^|b xy ) + (l-7)D^;||p xy ) (92) 

Where jH(p p x]y ) + (1 - j)H{pP y ) = R^ , p* is generally unique because both H(p^ y ) and H{p p xy ) are strictly 
increasing with p. 

Secondly 

E x {Rx, R y , 7) 

= inf {7^(feJ|Pxy) + (1 - 7)#K,lbxy) + max{0, flW - 7 #(<k| J/ ) - (1 - 7)^(0^)}} 
= mf{ m inf h^feyl |p*y) + (1 - l)D{o xy \ \p xy ) + max(0, R™ - b)}} 

= inf { inf {lD(q xy \ \p xy ) + (1 - j)D(o xy | |p xy ) 

&>7-ff(Px| r ) + (l-7)-H"(Pxy) q xy ,o xy :7-H"(<j x | y ) + (l-7)-ff(o xy )=f) 

+ max(0, i? (7) - b)}} (93) 
The last equality is true because, for b < 7i/(p x | y ) + (1 — r y)H(p xy ) < R(", 

inf {7^(^|bx y ) + (l-7)^(o„||Px y )+max(0,i? ( ^-fe)}} 

9x M ,o XH :7ff(9 x | B ) + (l-7)ff(o xy )=b 

> + i? (7) - & 

inf { 7 C( fey ||p xy ) + (l-7)^( O:c ,||p xy )+max(0,i?( 7 )-&)}} 

g xy ,o xy :ff(g x | y )=ff(p x | y ),/f(o xy )=.ff(p x) ,) 

> „, , j nf , „, , „, ,{l D (qxy\\Pxy) + (1 ~ l) D (oxy\\pxy) 



+ max(0, flW - jH( Pxly ) + (1 - 7 )ff (p xy ))}} 
ii 

Ki- 7 )ff(. 

+ max(0, JjW - 7 ^(Px|y) + (1 - 7)# (Px y ))}} 



> „, . i nf , „, Wl .{7-0(^11^) + (l-7)^(o^||Pxy) 

9 XH ,o XH :7ff(g x | B ) + (l- 7 )ff(o xy )=7_f/(p x | y ) + (l-7)_f/(p XJ ,) 
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Fixing b > r yH{p x \ y ) + (1 — ^)H(p xy ), the inner infimum in (1931 is an optimization problem on q xy , o xy with 
equality constraints Y, x Y, y 1x V (x,y) = 1, J] x Y, y o xy (x, y) = 1 and ^H(q x \ y ) + (1 - j)H(o xy ) = b and the 
obvious inequality constraints < q xy (x,y) < 1,0 < o xy (x,y) < l,Vx, y. In the following formulation of the 
optimization problem, we relax one equality constraint to an inequality constraint jH(q x i y ) + (1 — 7)i? (o^) > b 
to make the optimization problem convex. It turns out later that the optimal solution to the relaxed problem is also 
the optimal solution to the original problem because b > ~fH{p x \ y ) + (1 — ^)H{p xy ). The resulting optimization 
problem is: 

inf {jD(q xy \\p xy ) + (1 - j)D(o xy \\p xy )} 

x y 
x y 

b- 1 H(q x \ y )-(l~ 1 )H{o xy )<Q 
0<q xy (x,y)<l, V(x,y)eXxy 

< o xy {x, y) < 1, V(z, y) e X x y (94) 

The above optimization problem is convex because the objective function and the inequality constraint functions 
are convex and the equality constraint functions are affine[17]. The Lagrange multiplier function for this convex 
optimization problem is: 

L(q xy , o xy , p, fii,fi2, vi, V2, V3, va) 
= lD(q xy \\p xy ) + (1 - j)D(o xy \\p xy ) 

+t 1 i(J2J2<lx y {x,y) - l) + n 2 (^2J2o xy (x,y) - 1) 

x y x y 

+p(b - jH(q xly ) - (1 - l)H{o xy )) 

+ ^2Y {vi( x ,y)(-qx y (x,y)) + v 2 (x,y)(l - q xy {x,y)) + v 3 (x,y)(-o xy (x,y)) + v A (x,y){l - o xy (x,y))} 

x y 

(95) 

Where p,pi,p2 are real numbers and Vi G iZ^I^I, i = 1,2,3,4. 

According to the KKT conditions for convex optimization [17], q xy , o xy minimize the convex optimization problem 
in (I94> if and only if the following conditions are simultaneously satisfied for some q xy , o xy , pi, p.2, v\, V2, ^3, 

and p: 

Q _ dL(q xy , Q xy , p, p 1 ,P2,Vl,V2,V3,V4) 

dq xv {x,y) 

= 7[-log(pxyO,2/)) + (I + p)(l +log{q xv (x,y))) + plog(^2qxy{s,y))} + pi - ui(x,y) - v 2 {x,y) 

S 

Q _ dL(q xy , o xy , p, pi,p 2 , v\, v-i, vz, va) 
do xy (x,y) 

= (1 - j){-\og(p xy (x,y)) + (1 + p)(l + \og{o xy (x,y)))} + p 2 - v 3 (x,y) - Vi{x,y) (96) 
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For all x, y and 

^2^2<lxy(x,y) = 1 

x y 

J2^°xy( x >y) = 1 

x y 

p(lH(q xly ) + (1 - j)H(o xy ) - b) = 

P > o 

vi{x,y){-q xy {x,y)) = 0, 1*2(2:, y)(l - q xy {x,y)) = Vx,y 
i / 3(a;,y)(-o 2;3/ (x,?/)) = 0, i/4(a;,2/)(l-o xy (a;,3/))=0 Vx,y 
^(z,y)>0, Vz, y, t = 1, 2, 3, 4 (97) 
Solving the above standard Lagrange multiplier equations d96t and i9H , we have: 



ry{x,y) 



E s Pxy (5, y) ] 1+P " Pxy (x, y) 

1 

Pxy(x,y) 



E t E s Pxy(M) 1+ "> 

= Vi, y, i = 1, 2, 3, 4 
P = Pb (98) 
Where satisfies the following condition 

7 # (]^ y ) + (1 - = & > 7# bx|y) + (1 - 7)^(Pxy) 

and thus p& > because both H(p^ y ) and H(pP y ) are monotonically increasing with p as shown in Lemma[7]and 
Lemma ^JJ 

Notice that all the KKT conditions are simultaneously satisfied with the inequality constraint r )H(q x \ y ) + (1 — 
j)H(o xy ) > b being met with equality. Thus, the relaxed optimization problem has the same optimal solution as 
the original problem as promised. The optimal q xy and o xy are the x — y tilted distribution and standard tilted 
distribution p? y of p xy with the same parameter pb > 0. chosen s.t. 

jH{p p x b ly ) + (l- 7 )H(^) = b 

Now we have : 

E l x JN {R Xl R yil ) 

= h> m M _ .{ ff , ^ h {lD(q xy \\ Pxy ) + (1 - ^(a^) + max(0, R™ - b)}} 

b>~iH(p x \ y ) + (l-~f)H(p xy ) qxy,o Iy :~{H{q :cly ) + {l-~f)H{o !C y)=b 

= mf sm {7C(^y|bxy) + (l-7)^K y lbxy)+max(0,i?^-6)} 

= min[ inf { 7 £(Pxy I b*y) + (1 - 7 )^(Pxy p | \ Pxy ) + R h) - lH{p P x \ y ) - (1 - t)# (Pxy)}, 

p>0:flW> 7 H(^ |y ) + (l-7)H(p^) 7 x|y y 

inf {7-D(Pxylbxy) + (1 - l)D( Pxyp | \ Px y)}] (99) 

p>0:i?W)<7_ff(p x P |y ) + (l- 7 )ff(pJ y ) 

Notice that H(pP y ), H(p^ y ), D{pP y \\p xy ) and D(pP y \\p xy ) are all strictly increasing with p > as shown in 
Lemma ITU Lemma ITU Lemma and Lemma [8] later in this appendix. We have: 

, , /nf ^(Pxylbxy) + (1 - l)D{p xy \\p xy )} 

= jD(pQ\p xy ) + (l- 7 )D(p£\\ Pxy ) (100) 
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where R^ = jH(p?,) + (1 — r ))H(p^ y ). Applying the results in Lemma [131 and Lemma|9] we get: 

inf bDip^Wp^) + (1 - 7 )£>(^|| Pxy ) + JjW - iH{p p ) - (1 - 7 )Jf(p^)} 

p>0:fl(T)> 7 ff(p^)+(l- 7 )ff(p^) y y x|y y 

= lD{pP y \\ Pxy ) + (1 - 7 )^(K y || Pxy ) + flM - 7 ^(^, y ) - (1 - l)H{ P p xy )\ p=P * 

= lD{p% I bxy) + (1 - l)D{p% | |p xy ) (101) 

This is true because for p : > / yH(p^,) + (1 — r y)H(pP y ), we know p < 1 because of the range of i?^: 
r(j) < 1 H(p 1 x ^ y ) + (1 - 1 )H{p 1 xy ). Substituting ( fTUDl and fTTfll into we get 

(R x , Ry, 7 ) = 7 D(p£ | |p X y ) + (1 - 7 )£>(p£ | K ) 

where #M = -yH(p£ ]y ) + (1 - j)H(pQ (102) 

So for ryH{p x \ y ) + (1 — ^)H{p xy ) < R^ < jH{p x \ y ) + (1 — r ))H{p xy ), from ( l92l we have the desired property: 

E^ L (R x ,R v , 1 ) = E^ N (R x ,R y ^) 

B. case 2: R<rr) > 7 ^(p x | y ) + (1 - 7)#(pU 
In this case, for all < p < 1 

0iff £ (y,7,P) = fl(7) _ 7i7( -p^ _ (1 _ 7)ff ^ > fl(7) _ 7i/( -i |y) _ (1 _ 7)ff ^ > 

So p takes value 1 to maximize the error exponent E^ IL (R x , R y ,j, p), thus 

E* IL (R x ,R y , 7 ) = flW - 7 log(]T£> xy (:r, y)^f) - 2(1 - 7) log(]T ]>> xy (x, y)i) (103) 

y x y x 

Using the same convex optimization techniques as case|A| we notice the fact that p* > 1 for R^) = r yH(p^ y ) + 



(1 — r ))H(pP y ). Then applying Lemma IT3l and Lemma |9] we have 

Hi) 

jD(pl y \\p xy ) + (1 - 7 )^bxylbxy) + R h) ~ jH(pl ly ) - (1 - j)H( Pxy ) 



inf {7D (pP \\ p ) + ^- 7 ) D (pP \\ p ) + R M- lH (pP )-(l- 7 )ff(p^)}, 

p>0 : fl(T)> 7 _f/(p x P |y ) + (l-7)ff(pJ r ) 7 ly 



And 



, , {7^btyl|Pxy) + (1 - 7)^«lbxy)}] 

= 7 ^(^;ibxy)+(i-7)^(K;i^y) 

= 7 £>(p£ I K) + (1 - 7 )£>(p£ I | Pxy ) + flW - 7 ^(< y ) - (1 - 7)# (Pxy) 

< 7^(Pxylbxy) + (1 - 7 )^(Pxylbxy) + i? W " 7# (Px|y) ~ (l " l)H{pl y ) 
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Finally: 

E x (Rx, Ry, 7) 

= h> m M . m .{ ff , inf { 7 ^|| Pxy ) + (l- 7 )^(0xvlliv) + max(0, - 6)}} 

= min[ inf It^IK) + C 1 " 7)^xy I M + R^ ~ lH{p P x \ y ) l)H{p p xy )}, 

inf {^D^ylM + (1 - 7)^(^11^)}] 

p>0:_RW) <7H(p x P |y ) + (l-7)ff (rfy) 
= 7^(Pxylbxy) + (1 - 7)^(PxyHPxy) + R^ ~ 7# (Px|y) " (1 " 7)#(Pxy) 

= i?^- 7 iog(^(^p xy (x, y )5)2)_2(i- 7 )iog(^E^y( :z; 'y)^ ( 104 > 

y x y x 

The last equality is true by setting p = 1 in Lemma El an d Lemma [21 

Again, £* /L ,R y ,j) = E% N (R x ,R y ,j), thus we finish the proof. ■ 



C. Technical Lemmas 

Some technical lemmas we used in the above proof of Theorem [8] are now discussed: 

Lemma 7: aH jfv> > 

dp — 

Proof: From the definition of the tilted distribution we have the following observation: 

\og(pP y (xi,yi)) - log(pP y (x 2 ,y2)) = \og(p xy (x 1: yi)T^) - log(p xy (x 2 , y 2 ) T +^) 
Using the above equality, we first derive the derivative of the tilted distribution, for all x, y 

dp p xy {x,y) - 1 p xy (x,y)& log(p xy (x, y ) ) (Et S a P*y (*■ *) ^ ) 

^ ( 1 + ^ 2 (E*E s Pxy(s,t)^) 2 

-1 p xy (x, y)~ (E t Es Pxy (S, Q~ l0g(Pxy(5, t))) 
\2 



^ is 

T"-— PxyO, y)[log(Pxy(a;, y)) - ^ X^*y( s ' *) lo g(Pxy 0, *))] 

(105) 



1 + p 



Then: 
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dH(pP y ) = dZ^pPyfay) \og(pP y (x,y)) 
dp dp 



-£(l+log(^(z,y)))- 



dp 



+ log«(x,y)))^i^(log(pP y (x,y)) + H(pP y )) 



> 



jl— Y,^y)^g{p%{x,y)){\og{f xy {x,y)) + H(pP y )) 

x,y 

^-[^^(^^(log^C^y))) 2 ^^^^^)-^^) 5 
^[(£^,y)log(K y (x,y))) 2 - ff(^ y ) 2 ] 



x,y 

= (106) 
where (a) is true by the Cauchy-Schwartz inequality. ■ 

Lemma 8: ^M£l = p 2^l 

dp 1 . dp . . 

Proof: As shown in Lemma 1 141 and Lemma 1161 respectively: 

D(p p xy \\p xy ) = P H{p%) - (1 + p)log(5> xy (x,y)^) 
d(l + P )\og(E v E x Pxy^y)^) 

\Pxy) 



dp 



We have: 

dD(pp y \\ Pxy ) dH( P p y ) aq + ^iogCEyS^xy^y) 1 ^) 
Wp = H (Pxy) + p— p Qp 



9H(pP) 



dp 

dH(pP y ) 
dp 



Lemma 9: s ign &^§^^ = sign(p - 1). 
Proof: Combining the results of the previous two lemmas, we have: 



dD(pP y \\ Pxy ) ~ H(p? y ) ^.dH(pP y ) 

— dp =\P~ l ) — Qp~^ = sl 9 n (P - 1 ) 



Lemma 10: Properties of ^£l, and BH ^^ 

op 7 op 7 Op 7 Op op 



(107) 
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First, 

dC(x,y,p) dp xy (x,y)T^ 1 



Pxy{x, y) 1+ » log(p xy (x, y) i+p ) 



So 



dp dp 1 + p 

C{x,y,p) 

= 7- l °s(C{x,y,p)) 

l + P 

Ex y,p) log(C(s,y,p)) 
l + P 

For a differentiable function /(p), 

^^^/(p) 1+ H 0g (/(p)) + (i + p)/( P r^ 

= D(y, P y+"(log(D(y, p)) - £ log(C(a:, y, p))) 



(108) 



"<m-<- E^' 



A(i,,p)fr(^ |y=w ) 



9 P 
And last: 



= E ^) = E ^ = B{p) E ^) ^ |y=y ) = B {p) H^) 



dH &\y=y) 

dp 

dC(x,y,p) r , \ dD(y,p) 

^ [ ^(y,p) p)2 J[ + gl D(y,p) jJ 

v -^log(C( a :, y , P )) C(x, W ,p) ^' C( '^ + y c( '^» . CQr.j,,,) .. 
V ^(^) 2 
= HK| y ( x ' V) l °S(C(x, y, p)) - p£ |y (a;, y) XX| y ( s > y) log(C(s, y, p))][l + log(p^ |y (x, y))] 

' X s 

= Y^HPx|y( X 'y)[ l0 g(Px|y( X 'y)) -^Px|y( S ^) l0 g(Px|y( S ^))][ 1 + 1 °g(Px|y( X 'y))] 

X s 

= Y+~p 2^1/^' ^ l0 S^x|y( X ' 2/))[ l0 g(Px|y( X ' - H^x|y( S ' 2/) lo g(Px|y( S > «))] 

X s 

= YTp^^y^' ^ log ^| y ( x ' y)) lo g(Px| y y)) - YTpE^^'^ 108 ^^'^^ 2 

> (109) 

The inequality is true by the Cauchy-Schwartz inequality and by noticing that J2 x Px\ y ( x > v) = ^- ' 
These properties will again be used in the proofs in the following lemmas. 
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dH(p" ) 

Lemma 11: — » x|r > 



Proof: 



d ^W 1 , dA(y,p) dB(p) 

= -^(A(y, p)H(? xW=u )B(p) H(p p xly )B(p)A(y, p)) 



B{p) 
B{p) 



Now, 



> 



dH (P P x\y) = d A(y,p) sr C(x,y,p) , C(x,y,p) 
dp dp^ B(p) ^ D{y,p) 1 S( D{y,p) h 

d_yA(y,p) 

dp ^ B{ P ) nKp x\y=v> 

^ A(y,p)dH(p xly=y ) »y 

2f B{p) dp ^ dp yp x\y=y> 

y^EL H(pP ) 

V 



-( a ) (E^r^i^)) 2 -^) 2 



= (110) 



where (a) is again true by the Cauchy-Schwartz inequality. 

Lemma 12 : = p 9 -^ 

op J—, dp 

Proof: As shown in Lemma 1 151 and Lemma 1171 respectively: 

D{p p xy \\p xy ) = P H{f xW ) log(^(^ Pxy (x, y)^) 1+p ) 

V x 



dp 



We have: 



dp- = HiP ^+P— p Qp 

dH(p p , ) 

dH(p p , ) 
dp 
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. d[D(Pi y \\p xy )-H{p» )} 

Lemma 13: sign g- — — signyp — I). 

Proof: Using the previous lemma, we get: 

dD(pr y \\p xy )~H(p p , ) 



dp 



= (P-1) 



dp 



Then by Lemma fTTI we get the conclusion. 
Lemma 14: 

P H{pP xy ) - (l + p)log(^^ Pxy (x,j/)^) = D(pP y \\ Pxy ) 

y x 

Proof: By noticing that log(p xy (x, y)) = (1 + p)[log(p xy (x, y)) + log(X) M Pxy(s, i) 1 ^ )]. We have: 

^Klbx y ) = -H( p p xy ) - J2pS,(*> y) ^g( Pxy {x, y )) 

= -H( P p xy ) -Y^p'y^^y^ 1 + p)i l °s(p"y( x ^y)) + lo s(Y,p^ s ^^ 

x,y 8,t 

= -H(pP xy ) + (1 + p)H(pP xy ) - (1 + p) V) log(]>>x y (M)^ 

x.y s,t 

= pff( P g-(l + p)log£> xy (M)^) 



(112) 



Lemma 15: 
Proof: 

D(P$y\\P*y) 



j/ x 



B(p) D{y,p) 



Pxy(x,y) 



E E ^^^ [M^) + log(^#) - lo giPxy (x, y))] 



B(p) D(y,p) 



B(p) 



D(y,p) 



= \og(B(p)) - H(p» xW ) +Y,Y1 A B(p) C D(y y, p) MD{v ' ^ l ° g(C{X ' V > 

y % 

= - log(S(p)) - H(f x]y ) + (1 + p)if(p£ |y ) 
= -log(^(^Px y (^y)^) 1+p ) + P^(^ y ) 



3/ a; 



Lemma 16: 



Proof: 



tt( p \ _ g(l + P)log(E,E.Px y (x,y)^) 
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d(l+p)log(E,E a Px y Oc,?/)^) 

dp 

t o J/ x LtL s Pxy(M) 1+p 

= #(Ky) (H3) 



Lemma 17: 



Proof: Notice that = J2 y (Y, x Pxy(x,y) T ^) 1+p , and ^) = B(p)H{p P w ) as shown in Lemma [10| It is 

clear that: 

diog(E y (E x Pxy^,y)^) 1+p ) = d\og(B( P )) 

dp dp 

1 dB(p) 
B(p) dp 

= ff(P x , y ) (H4) 
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